Follow-the-Perturbed-Leader Achieves Best-of-Both-Worlds for the m-Set Semi-Bandit Problems

📅 2025-04-09

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

This paper studies the $m$-set semi-bandit problem under the combinatorial semi-bandit framework, aiming to unify learning in both adversarial and stochastic environments. We propose a Follow-the-Perturbed-Leader (FTPL) algorithm based on Fréchet-distribution perturbations. To our knowledge, this is the first method achieving best-of-both-worlds guarantees for this problem: it attains the optimal $O(sqrt{nmd})$ regret bound in adversarial settings and $O(log n)$ logarithmic regret in stochastic settings. Unlike Follow-the-Regularized-Leader (FTRL), which requires solving a computationally expensive convex optimization problem at each step, our FTPL algorithm relies only on lightweight perturbations and linear optimization over the action set—significantly improving computational efficiency. Theoretical analysis rigorously establishes the optimality and robustness of FTPL in combinatorial semi-bandits, thereby extending the applicability of perturbation-based methods to structured action spaces.

Technology Category

Application Category

📝 Abstract

We consider a common case of the combinatorial semi-bandit problem, the $m$-set semi-bandit, where the learner exactly selects $m$ arms from the total $d$ arms. In the adversarial setting, the best regret bound, known to be $mathcal{O}(sqrt{nmd})$ for time horizon $n$, is achieved by the well-known Follow-the-Regularized-Leader (FTRL) policy, which, however, requires to explicitly compute the arm-selection probabilities by solving optimizing problems at each time step and sample according to it. This problem can be avoided by the Follow-the-Perturbed-Leader (FTPL) policy, which simply pulls the $m$ arms that rank among the $m$ smallest (estimated) loss with random perturbation. In this paper, we show that FTPL with a Fr'echet perturbation also enjoys the optimal regret bound $mathcal{O}(sqrt{nmd})$ in the adversarial setting and achieves best-of-both-world regret bounds, i.e., achieves a logarithmic regret for the stochastic setting.

Problem

Research questions and friction points this paper is trying to address.

Achieving optimal adversarial regret for m-set semi-bandit problems

Avoiding explicit probability computation in bandit algorithms

Providing best-of-both-worlds performance in stochastic settings

Innovation

Methods, ideas, or system contributions that make the work stand out.

FTPL with Fréchet perturbation

Optimal adversarial regret bound

Best-of-both-worlds performance

🔎 Similar Papers

Multi-Player Approaches for Dueling Bandits