Data-dependent Bounds with $T$-Optimal Best-of-Both-Worlds Guarantees in Multi-Armed Bandits using Stability-Penalty Matching

📅 2025-02-12

📈 Citations: 0

✨ Influential: 0

career value

245K/year

🤖 AI Summary

Existing multi-armed bandit algorithms struggle to simultaneously achieve data dependence, best-of-both-worlds (BOBW) adaptivity, and $T$-optimal worst-case regret guarantees—either restricting themselves to purely stochastic or adversarial environments, or suffering suboptimal regret bounds (e.g., $O(sqrt{T ln T})$). This paper proposes the Stability-Penalty Matching (SPM) mechanism, integrated into the Follow-the-Regularized-Leader (FTRL) framework, which dynamically adapts the learning rate to structural properties of the data, including sparsity, total variation, and small losses. SPM is the first method to unify all three objectives: it attains an optimal $O(ln T)$ data-dependent regret in stochastic environments, achieves the BOBW-optimal $O(sqrt{T})$ worst-case regret in adversarial settings, and breaks the conventional $O(sqrt{T ln T})$ barrier. Theoretical analysis and empirical evaluation confirm its tightness and adaptivity across both worlds.

Technology Category

Application Category

📝 Abstract

Existing data-dependent and best-of-both-worlds regret bounds for multi-armed bandits problems have limited adaptivity as they are either data-dependent but not best-of-both-worlds (BOBW), BOBW but not data-dependent or have sub-optimal $O(sqrt{Tln{T}})$ worst-case guarantee in the adversarial regime. To overcome these limitations, we propose real-time stability-penalty matching (SPM), a new method for obtaining regret bounds that are simultaneously data-dependent, best-of-both-worlds and $T$-optimal for multi-armed bandits problems. In particular, we show that real-time SPM obtains bounds with worst-case guarantees of order $O(sqrt{T})$ in the adversarial regime and $O(ln{T})$ in the stochastic regime while simultaneously being adaptive to data-dependent quantities such as sparsity, variations, and small losses. Our results are obtained by extending the SPM technique for tuning the learning rates in the follow-the-regularized-leader (FTRL) framework, which further indicates that the combination of SPM and FTRL is a promising approach for proving new adaptive bounds in online learning problems.

Problem

Research questions and friction points this paper is trying to address.

Enhance adaptivity in multi-armed bandits.

Achieve data-dependent and best-of-both-worlds bounds.

Provide T-optimal guarantees in adversarial and stochastic regimes.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Real-time stability-penalty matching (SPM)

Combines SPM with FTRL framework

Achieves data-dependent, BOBW, T-optimal guarantees

🔎 Similar Papers

Optimal Multi-Fidelity Best-Arm Identification