Oracle-Efficient Combinatorial Semi-Bandits

📅 2025-10-24

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This paper studies combinatorial semi-bandits, where an agent selects a subset of base arms per round and observes feedback from each selected arm. While practically important, existing algorithms rely on one expensive combinatorial optimization oracle call per round, severely limiting scalability. To address this, we propose a novel online learning framework that reduces the per-round oracle calls to only $O(log log T)$ while achieving the optimal $O(sqrt{T})$ regret bound. Our key contributions are: (1) a covariance-adaptive UCB strategy that explicitly models the reward noise structure; (2) a unified treatment accommodating both linear and nonlinear reward functions; and (3) tight theoretical guarantees under both worst-case and general smooth reward settings. Experiments demonstrate significant improvements in both computational efficiency and empirical performance.

Technology Category

Application Category

📝 Abstract

We study the combinatorial semi-bandit problem where an agent selects a subset of base arms and receives individual feedback. While this generalizes the classical multi-armed bandit and has broad applicability, its scalability is limited by the high cost of combinatorial optimization, requiring oracle queries at every round. To tackle this, we propose oracle-efficient frameworks that significantly reduce oracle calls while maintaining tight regret guarantees. For the worst-case linear reward setting, our algorithms achieve $ ilde{O}(sqrt{T})$ regret using only $O(loglog T)$ oracle queries. We also propose covariance-adaptive algorithms that leverage noise structure for improved regret, and extend our approach to general (non-linear) rewards. Overall, our methods reduce oracle usage from linear to (doubly) logarithmic in time, with strong theoretical guarantees.

Problem

Research questions and friction points this paper is trying to address.

Reducing oracle calls in combinatorial semi-bandits

Achieving sublinear regret with logarithmic oracle queries

Extending framework to nonlinear rewards with guarantees

Innovation

Methods, ideas, or system contributions that make the work stand out.

Oracle-efficient frameworks reduce combinatorial optimization costs

Algorithms achieve logarithmic oracle queries with tight regret

Covariance-adaptive methods leverage noise for improved performance

🔎 Similar Papers

Fast and Sample Efficient Multi-Task Representation Learning in Stochastic Contextual Bandits

2024-10-02International Conference on Machine LearningCitations: 1

Identifiable latent bandits: Combining observational data and exploration for personalized healthcare

2024-07-23arXiv.orgCitations: 0

Authors to Follow