🤖 AI Summary
This paper investigates the learnability of reward function classes ℱ in structured multi-armed bandits, aiming to establish a PAC-style theoretical framework. Addressing the core questions—“which ℱ are efficiently learnable?” and “how to learn them?”—the authors prove, for the first time, that no single combinatorial dimension (e.g., VC-dimension analogues) universally characterizes bandit learnability. They demonstrate an intrinsic decoupling between statistical feasibility and computational hardness: even when query complexity is minimal, optimal arm identification remains NP-hard (assuming RP ≠ NP). Furthermore, the work systematically delineates fundamental trade-offs among noise robustness, empirical risk minimization (ERM) realizability, and learning efficiency. Collectively, these results refute the universality of dimension-driven paradigms in bandit learning and lay a new foundation for bandit learning theory.
📝 Abstract
We study the task of bandit learning, also known as best-arm identification, under the assumption that the true reward function f belongs to a known, but arbitrary, function class F. We seek a general theory of bandit learnability, akin to the PAC framework for classification. Our investigation is guided by the following two questions: (1) which classes F are learnable, and (2) how they are learnable. For example, in the case of binary PAC classification, learnability is fully determined by a combinatorial dimension - the VC dimension- and can be attained via a simple algorithmic principle, namely, empirical risk minimization (ERM). In contrast to classical learning-theoretic results, our findings reveal limitations of learning in structured bandits, offering insights into the boundaries of bandit learnability. First, for the question of"which", we show that the paradigm of identifying the learnable classes via a dimension-like quantity fails for bandit learning. We give a simple proof demonstrating that no combinatorial dimension can characterize bandit learnability, even in finite classes, following a standard definition of dimension introduced by Ben-David et al. (2019). For the question of"how", we prove a computational hardness result: we construct a reward function class for which at most two queries are needed to find the optimal action, yet no algorithm can do so in polynomial time unless RP=NP. We also prove that this class admits efficient algorithms for standard algorithmic operations often considered in learning theory, such as an ERM. This implies that computational hardness is in this case inherent to the task of bandit learning. Beyond these results, we investigate additional themes such as learning under noise, trade-offs between noise models, and the relationship between query complexity and regret minimization.