The Sample Complexity of Multiclass and Sparse Contextual Bandits

📅 2026-05-28

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

This work addresses the problem of identifying an $\varepsilon$-optimal policy in stochastic contextual bandits with $s$-sparse rewards. The authors propose an algorithm whose sample complexity depends only on the sparsity level $s$, rather than high-degree polynomials of the action space size $|A|$. By integrating information-theoretic analysis based on the Decision-Estimation Coefficient (DEC) with low-variance exploration techniques, the method applies to policy classes with bounded Natarajan dimension and extends to combinatorial semi-bandit settings. The resulting sample complexity is $\widetilde{O}\big((s/\varepsilon^2 + |A|/\varepsilon) \log(|\Pi|/\delta)\big)$, which is near-optimal up to logarithmic factors. This significantly improves upon prior results that scaled with $|A|^9$ and provides the first tight upper bound featuring no higher-order dependence on $|A|$.

📝 Abstract

We study contextual bandits in the stochastic i.i.d.\ setting, where a learner observes contexts drawn from an unknown distribution, selects actions from a finite set $A$, and aims to identify an approximately optimal policy from a given class based on bandit feedback. Motivated by bandit multiclass classification with zero-one rewards, we focus on the \emph{$s$-sparse} setting in which, for every context, the reward vector has $L_1$-norm at most $s \ll |A|$. Our main result is the design of algorithms that, with high probability, output an $ε$-optimal policy compared to policy class $Π$ using $\tilde{O} ((s/ε^2 + |A|/ε)\log |Π|/δ)$ samples. We extend this bound to general Natarajan classes and complement it with a matching lower bound (up to logarithmic factors), thereby closing a substantial gap left by prior work (Erez et al., 2024, 2025), which incurred an additional $Θ(|A|^9)$ dependence. We obtain these results via two complementary approaches. First, we analyze contextual bandits through the lens of contextual decision making with structured observations, designing an exploration-by-optimization algorithm whose sample complexity is governed by the \emph{decision-estimation coefficient} (DEC; Foster et al., 2021, 2022). We show that, with $s$-sparse rewards, the induced model class admits a sharp DEC bound that scales with $s$ and directly yields the optimal rate. Since this approach is largely information-theoretic and involves solving complex min-max optimization problems, we also develop a second, more specialized algorithmic method based on a low-variance exploration technique. This approach leads to concrete, tractable algorithms and naturally extends to contextual combinatorial semi-bandits, leading to improved sample complexity guarantees for bandit multiclass list classification.

Problem

Research questions and friction points this paper is trying to address.

contextual bandits

sample complexity

sparse rewards

multiclass classification

stochastic i.i.d. setting

Innovation

Methods, ideas, or system contributions that make the work stand out.

sparse contextual bandits

sample complexity

decision-estimation coefficient