Efficient and Optimal Policy Gradient Algorithm for Corrupted Multi-armed Bandits

📅 2025-02-19

📈 Citations: 0

✨ Influential: 0

career value

282K/year

🤖 AI Summary

This paper studies the stochastic multi-armed bandit problem under adversarial corruption, where an adversary may partially manipulate the random rewards of arms to mislead the learning algorithm. To handle unknown corruption magnitude, we propose SAMBA—a computationally efficient policy-gradient-based algorithm that integrates adaptive learning rates with corruption-robust reward estimation. Theoretically, SAMBA achieves an asymptotically optimal regret bound of $O(K log T / Delta) + O(C / Delta)$, eliminating the redundant $log T$ factor present in prior state-of-the-art efficient algorithms—marking the first such result for polynomial-time methods. Empirically, SAMBA significantly outperforms baselines including CBARBAR under high corruption levels, demonstrating superior stability and convergence properties.

Technology Category

Application Category

📝 Abstract

In this paper, we consider the stochastic multi-armed bandits problem with adversarial corruptions, where the random rewards of the arms are partially modified by an adversary to fool the algorithm. We apply the policy gradient algorithm SAMBA to this setting, and show that it is computationally efficient, and achieves a state-of-the-art $O(Klog T/Delta) + O(C/Delta)$ regret upper bound, where $K$ is the number of arms, $C$ is the unknown corruption level, $Delta$ is the minimum expected reward gap between the best arm and other ones, and $T$ is the time horizon. Compared with the best existing efficient algorithm (e.g., CBARBAR), whose regret upper bound is $O(Klog^2 T/Delta) + O(C)$, we show that SAMBA reduces one $log T$ factor in the regret bound, while maintaining the corruption-dependent term to be linear with $C$. This is indeed asymptotically optimal. We also conduct simulations to demonstrate the effectiveness of SAMBA, and the results show that SAMBA outperforms existing baselines.

Problem

Research questions and friction points this paper is trying to address.

Address adversarial corruptions in multi-armed bandits

Develop efficient policy gradient algorithm SAMBA

Achieve optimal regret bound with reduced complexity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Policy gradient algorithm SAMBA

Handles adversarial corruptions efficiently

Achieves optimal regret upper bound

🔎 Similar Papers

A Policy-Gradient Approach to Solving Imperfect-Information Games with Iterate Convergence