Multi-Play Combinatorial Semi-Bandit Problem

📅 2025-09-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing combinatorial semi-bandits (CSBs) are restricted to binary actions, limiting their applicability to fundamental combinatorial optimization problems such as optimal transport and knapsack, which require nonnegative integer-valued action vectors. Method: We propose the Multi-Choice Combinatorial Semi-Bandit (MP-CSB) framework—the first to generalize action spaces to nonnegative integer vectors—and design an efficient Thompson sampling–based algorithm. To ensure robustness against both stochastic and adversarial environments, we introduce a “dual-robust” algorithm integrating variance-adaptive analysis, path-length control, and quadratic variation techniques to handle exponential action spaces and heterogeneous feedback. Contribution/Results: We establish tight theoretical guarantees: an $O(log T)$ distribution-dependent regret under stochastic rewards and a $ ilde{mathcal{O}}(sqrt{T})$ worst-case regret under adversarial rewards. Empirical evaluation demonstrates significant improvements over state-of-the-art CSB methods across diverse combinatorial optimization benchmarks.

Technology Category

Application Category

📝 Abstract
In the combinatorial semi-bandit (CSB) problem, a player selects an action from a combinatorial action set and observes feedback from the base arms included in the action. While CSB is widely applicable to combinatorial optimization problems, its restriction to binary decision spaces excludes important cases involving non-negative integer flows or allocations, such as the optimal transport and knapsack problems.To overcome this limitation, we propose the multi-play combinatorial semi-bandit (MP-CSB), where a player can select a non-negative integer action and observe multiple feedbacks from a single arm in each round. We propose two algorithms for the MP-CSB. One is a Thompson-sampling-based algorithm that is computationally feasible even when the action space is exponentially large with respect to the number of arms, and attains $O(log T)$ distribution-dependent regret in the stochastic regime, where $T$ is the time horizon. The other is a best-of-both-worlds algorithm, which achieves $O(log T)$ variance-dependent regret in the stochastic regime and the worst-case $ ilde{mathcal{O}}left( sqrt{T} ight)$ regret in the adversarial regime. Moreover, its regret in adversarial one is data-dependent, adapting to the cumulative loss of the optimal action, the total quadratic variation, and the path-length of the loss sequence. Finally, we numerically show that the proposed algorithms outperform existing methods in the CSB literature.
Problem

Research questions and friction points this paper is trying to address.

Extends combinatorial bandits to non-negative integer action spaces
Solves limitations in optimal transport and knapsack problems
Develops algorithms for stochastic and adversarial regret regimes
Innovation

Methods, ideas, or system contributions that make the work stand out.

Extends CSB to non-negative integer actions
Uses Thompson-sampling for exponential action spaces
Best-of-both-worlds algorithm for stochastic/adversarial regimes
🔎 Similar Papers
No similar papers found.
S
Shintaro Nakamura
University of Tokyo, Tokyo, Japan
Yuko Kuroki
Yuko Kuroki
CENTAI Institute
OptimizationMachine LearningBandit AlgorithmGraph MiningComputer Science
W
Wei Chen
Microsoft Research, Beijing, China