🤖 AI Summary
This paper studies the $m$-set semi-bandit problem under the combinatorial semi-bandit framework, aiming to unify learning in both adversarial and stochastic environments. We propose a Follow-the-Perturbed-Leader (FTPL) algorithm based on Fréchet-distribution perturbations. To our knowledge, this is the first method achieving best-of-both-worlds guarantees for this problem: it attains the optimal $O(sqrt{nmd})$ regret bound in adversarial settings and $O(log n)$ logarithmic regret in stochastic settings. Unlike Follow-the-Regularized-Leader (FTRL), which requires solving a computationally expensive convex optimization problem at each step, our FTPL algorithm relies only on lightweight perturbations and linear optimization over the action set—significantly improving computational efficiency. Theoretical analysis rigorously establishes the optimality and robustness of FTPL in combinatorial semi-bandits, thereby extending the applicability of perturbation-based methods to structured action spaces.
📝 Abstract
We consider a common case of the combinatorial semi-bandit problem, the $m$-set semi-bandit, where the learner exactly selects $m$ arms from the total $d$ arms. In the adversarial setting, the best regret bound, known to be $mathcal{O}(sqrt{nmd})$ for time horizon $n$, is achieved by the well-known Follow-the-Regularized-Leader (FTRL) policy, which, however, requires to explicitly compute the arm-selection probabilities by solving optimizing problems at each time step and sample according to it. This problem can be avoided by the Follow-the-Perturbed-Leader (FTPL) policy, which simply pulls the $m$ arms that rank among the $m$ smallest (estimated) loss with random perturbation. In this paper, we show that FTPL with a Fr'echet perturbation also enjoys the optimal regret bound $mathcal{O}(sqrt{nmd})$ in the adversarial setting and achieves best-of-both-world regret bounds, i.e., achieves a logarithmic regret for the stochastic setting.