Understanding the Adaptivity Barrier in Batched...

Understanding the Adaptivity Barrier in Batched Nonparametric Bandits: Why Unknown Margin Increases Sample Costs

In the realm of machine learning, particularly in reinforcement learning and online decision-making, bandit algorithms are crucial for optimizing choices under uncertainty. When dealing with batched nonparametric bandits, a specific set of challenges arises that can significantly inflate the cost of learning. This article delves into the adaptivity barrier inherent in these systems, focusing on why an unknown margin between the best and subsequent options exacerbates sample requirements.

Problem Framing: Batched Nonparametric Bandits

We consider a K-armed batched nonparametric bandit setting over T rounds, with a fixed batch size B. In each round, the algorithm tests B arms concurrently, each yielding noisy, nonparametric rewards. The core difficulty lies in adaptivity: algorithms update their strategy for the next batch (round) using data gathered from previous batches. However, feedback within a batch is delayed, meaning decisions for the current batch cannot be informed by results from within that same batch. This inherent delay creates an adaptivity barrier, slowing down convergence and increasing the overall sample cost.

The Challenge of Unknown Margins

A critical factor contributing to this barrier is the unknown margin, denoted by Δ. This margin represents the difference between the reward of the best arm and the second-best arm for a given context. When algorithms make assumptions about a fixed or known margin, they may misallocate samples, particularly when the actual margin is small or uncertain. This uncertainty about the gap necessitates more exploration, directly leading to higher sample costs. In worst-case scenarios, the total sample cost can scale roughly as 1/Δ², with batching adding conservative logarithmic factors.

Context, Interference, and practical Payoffs

The problem is further complicated by contextual features, where arm rewards depend on an observed context x. In networked environments, actions can also lead to interference, where one arm’s reward is affected by other arms’ actions. Adaptivity in such systems must account for these interactions to avoid biased estimates and inflated costs. A practical solution lies in a margin-robust, batch-aware nonparametric approach that explicitly models batching and margin uncertainty to achieve reliable convergence with controlled sample costs.

Key Definitions and Setup

Term	Definition	Formula / Note
Batching	Time is partitioned into B batches. In batch b, the policy chooses a set of arms to pull using only data from batches 1 to b-1. All pulls in batch b are completed before updating for batch b+1.	N = ∑_b=1^B n_b; t ∈ batch b uses decisions A_b determined by D_b-1.
Adaptivity	The policy uses observed data up to batch b-1 to adjust actions in batch b. There is no adaptation within a single batch.	Decision rule: A_b = π(D_b-1); D_b := D_b-1 ∪ {rewards in batch b}.
Context	Each time step is associated with a context x drawn from a distribution p_X. The reward of each arm can depend on x.	R_t(a) ∼ r(a; x_t) + noise; x_t ∼ p_X.
Margin Δ	The difficulty of distinguishing the best arm from the second-best for a given context. Δ can be context-specific or take a worst-case form across contexts.	Δ(x) = r^(x)* − max_{a ≠ a^(x)} r(a; x)*. Global margin: Δ_min = inf_x Δ(x) or typical Δ̄.
Sample complexity	The amount of data (pulls) required to identify the best action with high confidence or to reach a target regret. In batched, nonparametric settings, this depends on Δ, context distribution, and smoothness.	N(δ, Δ_min) ≈ O( (log(1/δ)) / Δ_min² ) per hard context; global bounds scale with context complexity and nonparametric smoothness.

Note: While the table uses standard bandit notation, in practice, Δ_min is often unknown and context-dependent. Algorithms typically estimate margins on the fly and adjust batch sizes accordingly.

Experimental Setup and Example

Our setup involves K arms and contextual features x ∈ ℝ^d drawn from p_X. Rewards are modeled as R(a; x) = μ_a(x) + noise, where μ_a(·) encodes nonparametric reward structures and noise ∼ N(0, σ²). We employ batching across B rounds with specified batch sizes n_b. Decisions for batch b rely only on data from prior batches. Our goal is to minimize cumulative regret or identify the best action with high confidence, quantifying the sample cost driven by Δ_min-uncertainty.

Example: Consider K = 4 arms, d = 2 context features, p_X uniform over a unit square, and smooth but unknown nonparametric functions μ_a(x). We might use B = 5 batches with equal sizes n_b = 20, reporting performance as a function of B, N, and measured regret R_T.

The Margin-Robust Batched Nonparametric UCB (MR-BNPUCB) Algorithm

To address these challenges, we introduce the Margin-Robust Batched Nonparametric UCB (MR-BNPUCB) algorithm. This algorithm pairs a nonparametric mean estimator with a principled confidence bound, designed for batched experiments where updates occur after each batch.

Algorithm Steps:

Initialization: Define the domain X, set hyperparameters (δ, B, T, kernel, bandwidth h, regularization ε). Collect an initial batch D₀.
Nonparametric Fit: Fit a nonparametric estimator f̂_t(·) (e.g., Nadaraya–Watson) to the accumulated data D_t. Compute a pointwise uncertainty proxy s_t(·).
Batch Selection (for t = 1 to T): Construct a candidate set C_t ⊂ X. For each x ∈ C_t, compute the upper confidence bound U_t(x) = f̂_t(x) + β_t · s_t(x). Select the batch S_t of B points with the largest U_t(x) values.
Update: Query responses for x ∈ S_t and update D_t+1 = D_t ∪ {(x, y): x ∈ S_t}. Monitor performance and adjust if needed.

Modular Code Scaffolding:

Data generator: Provides X samples and optional ground-truth functions.
Nonparametric estimator: Implements f̂_t(·) with a kernel smoother or other models.
Confidence bound (uncertainty) module: Computes s_t(·).
Batch scheduler: Selects the top-B batch S_t based on U_t(x).
Evaluator: Tracks metrics like cumulative regret.

Concrete Hyperparameters:

Parameter	Guidance / Typical values
Kernel	Gaussian (default), Epanechnikov
Bandwidth / lengthscale (h)	Controls smoothing; e.g., h ∈ [0.1, 1.0] for 1D
Regularization ε	1e-6 to 1e-3
Confidence level δ	0.05–0.2 (commonly 0.1)
Batch size B	5–20
Rounds T	20–200

Experimental Design and Reproducibility

To rigorously study these algorithms, we employ a synthetic data generator that allows for flexible nonparametric reward functions and tunable noise levels. This setup enables us to isolate the impact of factors like batch size, margin uncertainty, and interference.

Synthetic Data Generator:

Context generator: x_t drawn from N(0, I_d) or a unit cube.
Arm reward functions: Nonparametric functions μ_a(x) defined via basis expansions (e.g., random Fourier features, splines).
Noise: ε_t ∼ N(0, σ²), adjusted for desired Signal-to-Noise Ratio (SNR).
Unknown margin: Δ(x) is hidden from the learner, allowing study of sample cost growth as Δ(x) shrinks.

Incorporating Network Interference (MABNI):

We model interference by augmenting the reward function: Y_t(a) = μ_a(x_t) + η · Σ_{b ≠ a} I_t,b} · φ_interf(a,b) + ε_t, where η controls interference strength and φ_interf(a,b) defines the kernel. Interference-aware estimators adjust variance or explicitly account for these cross-arm effects.

Evaluation Metrics:

Cumulative regret R_T: Measures long-run performance efficiency.
Identification accuracy of the best arm: Assesses learning reliability.
Explicit sample-cost measurements: Tracks total cost and cost per observation.

Ablation Studies:

We conduct ablation studies on:

Batch size: Varying update frequency.
Margin uncertainty: Altering confidence bound widths.
Interference: Adjusting strength and structure.

Reproducibility is ensured by documenting seeds, data distributions, sampling rules, and versioning code and environments.

Conclusion

By explicitly modeling batch structure, context, unknown margins, and interference, we can develop robust algorithms like MR-BNPUCB that navigate the adaptivity barrier in batched nonparametric bandits. This focused approach ensures reliable convergence and controlled sample costs, providing a practical blueprint for tackling complex online decision-making problems.

Understanding the Adaptivity Barrier in Batched…