Multi-Play Combinatorial Semi-Bandit Problem

📅 2025-09-11

📈 Citations: 0

✨ Influential: 0

career value

257K/year

🤖 AI Summary

Existing combinatorial semi-bandits (CSBs) are restricted to binary actions, limiting their applicability to fundamental combinatorial optimization problems such as optimal transport and knapsack, which require nonnegative integer-valued action vectors. Method: We propose the Multi-Choice Combinatorial Semi-Bandit (MP-CSB) framework—the first to generalize action spaces to nonnegative integer vectors—and design an efficient Thompson sampling–based algorithm. To ensure robustness against both stochastic and adversarial environments, we introduce a “dual-robust” algorithm integrating variance-adaptive analysis, path-length control, and quadratic variation techniques to handle exponential action spaces and heterogeneous feedback. Contribution/Results: We establish tight theoretical guarantees: an $O(log T)$ distribution-dependent regret under stochastic rewards and a $ ilde{mathcal{O}}(sqrt{T})$ worst-case regret under adversarial rewards. Empirical evaluation demonstrates significant improvements over state-of-the-art CSB methods across diverse combinatorial optimization benchmarks.

Technology Category

Application Category

📝 Abstract

In the combinatorial semi-bandit (CSB) problem, a player selects an action from a combinatorial action set and observes feedback from the base arms included in the action. While CSB is widely applicable to combinatorial optimization problems, its restriction to binary decision spaces excludes important cases involving non-negative integer flows or allocations, such as the optimal transport and knapsack problems.To overcome this limitation, we propose the multi-play combinatorial semi-bandit (MP-CSB), where a player can select a non-negative integer action and observe multiple feedbacks from a single arm in each round. We propose two algorithms for the MP-CSB. One is a Thompson-sampling-based algorithm that is computationally feasible even when the action space is exponentially large with respect to the number of arms, and attains $O(log T)$ distribution-dependent regret in the stochastic regime, where $T$ is the time horizon. The other is a best-of-both-worlds algorithm, which achieves $O(log T)$ variance-dependent regret in the stochastic regime and the worst-case $ ilde{mathcal{O}}left( sqrt{T} ight)$ regret in the adversarial regime. Moreover, its regret in adversarial one is data-dependent, adapting to the cumulative loss of the optimal action, the total quadratic variation, and the path-length of the loss sequence. Finally, we numerically show that the proposed algorithms outperform existing methods in the CSB literature.

Problem

Research questions and friction points this paper is trying to address.

Extends combinatorial bandits to non-negative integer action spaces

Solves limitations in optimal transport and knapsack problems

Develops algorithms for stochastic and adversarial regret regimes

Innovation

Methods, ideas, or system contributions that make the work stand out.

Extends CSB to non-negative integer actions

Uses Thompson-sampling for exponential action spaces

Best-of-both-worlds algorithm for stochastic/adversarial regimes

🔎 Similar Papers

Fast and Sample Efficient Multi-Task Representation Learning in Stochastic Contextual Bandits

2024-10-02International Conference on Machine LearningCitations: 1

Amazon

Arlington, VA, USA / Bellevue, WA, USA / Boston, MA, USA

Master Thesis Reinforcement Learning for Behavior Planning in Automated Driving

Bosch Group

Renningen, BW, DE

Authors to Follow