Oracle-Efficient Combinatorial Semi-Bandits

📅 2025-10-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper studies combinatorial semi-bandits, where an agent selects a subset of base arms per round and observes feedback from each selected arm. While practically important, existing algorithms rely on one expensive combinatorial optimization oracle call per round, severely limiting scalability. To address this, we propose a novel online learning framework that reduces the per-round oracle calls to only $O(log log T)$ while achieving the optimal $O(sqrt{T})$ regret bound. Our key contributions are: (1) a covariance-adaptive UCB strategy that explicitly models the reward noise structure; (2) a unified treatment accommodating both linear and nonlinear reward functions; and (3) tight theoretical guarantees under both worst-case and general smooth reward settings. Experiments demonstrate significant improvements in both computational efficiency and empirical performance.

Technology Category

Application Category

📝 Abstract
We study the combinatorial semi-bandit problem where an agent selects a subset of base arms and receives individual feedback. While this generalizes the classical multi-armed bandit and has broad applicability, its scalability is limited by the high cost of combinatorial optimization, requiring oracle queries at every round. To tackle this, we propose oracle-efficient frameworks that significantly reduce oracle calls while maintaining tight regret guarantees. For the worst-case linear reward setting, our algorithms achieve $ ilde{O}(sqrt{T})$ regret using only $O(loglog T)$ oracle queries. We also propose covariance-adaptive algorithms that leverage noise structure for improved regret, and extend our approach to general (non-linear) rewards. Overall, our methods reduce oracle usage from linear to (doubly) logarithmic in time, with strong theoretical guarantees.
Problem

Research questions and friction points this paper is trying to address.

Reducing oracle calls in combinatorial semi-bandits
Achieving sublinear regret with logarithmic oracle queries
Extending framework to nonlinear rewards with guarantees
Innovation

Methods, ideas, or system contributions that make the work stand out.

Oracle-efficient frameworks reduce combinatorial optimization costs
Algorithms achieve logarithmic oracle queries with tight regret
Covariance-adaptive methods leverage noise for improved performance
J
Jung-hun Kim
CREST, ENSAE, IP Paris FairPlay joint team, France
M
Milan Vojnović
London School of Economics United Kingdom
Min-hwan Oh
Min-hwan Oh
Seoul National University
Reinforcement LearningBandit AlgorithmsMachine Learning