🤖 AI Summary
This paper studies combinatorial semi-bandits, where an agent selects a subset of base arms per round and observes feedback from each selected arm. While practically important, existing algorithms rely on one expensive combinatorial optimization oracle call per round, severely limiting scalability. To address this, we propose a novel online learning framework that reduces the per-round oracle calls to only $O(log log T)$ while achieving the optimal $O(sqrt{T})$ regret bound. Our key contributions are: (1) a covariance-adaptive UCB strategy that explicitly models the reward noise structure; (2) a unified treatment accommodating both linear and nonlinear reward functions; and (3) tight theoretical guarantees under both worst-case and general smooth reward settings. Experiments demonstrate significant improvements in both computational efficiency and empirical performance.
📝 Abstract
We study the combinatorial semi-bandit problem where an agent selects a subset of base arms and receives individual feedback. While this generalizes the classical multi-armed bandit and has broad applicability, its scalability is limited by the high cost of combinatorial optimization, requiring oracle queries at every round. To tackle this, we propose oracle-efficient frameworks that significantly reduce oracle calls while maintaining tight regret guarantees. For the worst-case linear reward setting, our algorithms achieve $ ilde{O}(sqrt{T})$ regret using only $O(loglog T)$ oracle queries. We also propose covariance-adaptive algorithms that leverage noise structure for improved regret, and extend our approach to general (non-linear) rewards. Overall, our methods reduce oracle usage from linear to (doubly) logarithmic in time, with strong theoretical guarantees.