🤖 AI Summary
Existing multi-armed bandit algorithms struggle to simultaneously achieve data dependence, best-of-both-worlds (BOBW) adaptivity, and $T$-optimal worst-case regret guarantees—either restricting themselves to purely stochastic or adversarial environments, or suffering suboptimal regret bounds (e.g., $O(sqrt{T ln T})$). This paper proposes the Stability-Penalty Matching (SPM) mechanism, integrated into the Follow-the-Regularized-Leader (FTRL) framework, which dynamically adapts the learning rate to structural properties of the data, including sparsity, total variation, and small losses. SPM is the first method to unify all three objectives: it attains an optimal $O(ln T)$ data-dependent regret in stochastic environments, achieves the BOBW-optimal $O(sqrt{T})$ worst-case regret in adversarial settings, and breaks the conventional $O(sqrt{T ln T})$ barrier. Theoretical analysis and empirical evaluation confirm its tightness and adaptivity across both worlds.
📝 Abstract
Existing data-dependent and best-of-both-worlds regret bounds for multi-armed bandits problems have limited adaptivity as they are either data-dependent but not best-of-both-worlds (BOBW), BOBW but not data-dependent or have sub-optimal $O(sqrt{Tln{T}})$ worst-case guarantee in the adversarial regime. To overcome these limitations, we propose real-time stability-penalty matching (SPM), a new method for obtaining regret bounds that are simultaneously data-dependent, best-of-both-worlds and $T$-optimal for multi-armed bandits problems. In particular, we show that real-time SPM obtains bounds with worst-case guarantees of order $O(sqrt{T})$ in the adversarial regime and $O(ln{T})$ in the stochastic regime while simultaneously being adaptive to data-dependent quantities such as sparsity, variations, and small losses. Our results are obtained by extending the SPM technique for tuning the learning rates in the follow-the-regularized-leader (FTRL) framework, which further indicates that the combination of SPM and FTRL is a promising approach for proving new adaptive bounds in online learning problems.