🤖 AI Summary
Existing combinatorial semi-bandits (CSBs) are restricted to binary actions, limiting their applicability to fundamental combinatorial optimization problems such as optimal transport and knapsack, which require nonnegative integer-valued action vectors.
Method: We propose the Multi-Choice Combinatorial Semi-Bandit (MP-CSB) framework—the first to generalize action spaces to nonnegative integer vectors—and design an efficient Thompson sampling–based algorithm. To ensure robustness against both stochastic and adversarial environments, we introduce a “dual-robust” algorithm integrating variance-adaptive analysis, path-length control, and quadratic variation techniques to handle exponential action spaces and heterogeneous feedback.
Contribution/Results: We establish tight theoretical guarantees: an $O(log T)$ distribution-dependent regret under stochastic rewards and a $ ilde{mathcal{O}}(sqrt{T})$ worst-case regret under adversarial rewards. Empirical evaluation demonstrates significant improvements over state-of-the-art CSB methods across diverse combinatorial optimization benchmarks.
📝 Abstract
In the combinatorial semi-bandit (CSB) problem, a player selects an action from a combinatorial action set and observes feedback from the base arms included in the action. While CSB is widely applicable to combinatorial optimization problems, its restriction to binary decision spaces excludes important cases involving non-negative integer flows or allocations, such as the optimal transport and knapsack problems.To overcome this limitation, we propose the multi-play combinatorial semi-bandit (MP-CSB), where a player can select a non-negative integer action and observe multiple feedbacks from a single arm in each round. We propose two algorithms for the MP-CSB. One is a Thompson-sampling-based algorithm that is computationally feasible even when the action space is exponentially large with respect to the number of arms, and attains $O(log T)$ distribution-dependent regret in the stochastic regime, where $T$ is the time horizon. The other is a best-of-both-worlds algorithm, which achieves $O(log T)$ variance-dependent regret in the stochastic regime and the worst-case $ ilde{mathcal{O}}left( sqrt{T}
ight)$ regret in the adversarial regime. Moreover, its regret in adversarial one is data-dependent, adapting to the cumulative loss of the optimal action, the total quadratic variation, and the path-length of the loss sequence. Finally, we numerically show that the proposed algorithms outperform existing methods in the CSB literature.