🤖 AI Summary
This paper investigates the global convergence of stochastic gradient multi-armed bandit (SG-Bandit) algorithms under arbitrary constant learning rates, focusing on non-ideal settings where standard smoothness and bounded-noise assumptions fail. Methodologically, it integrates tools from stochastic optimization, bandit theory, and probabilistic convergence analysis. The key contribution is the first rigorous proof that SG-Bandit converges almost surely to the globally optimal policy—even without smoothness, under non-stationary noise, and with only weak noise control—thereby eliminating reliance on learning-rate decay or strong regularity conditions. Crucially, the analysis uncovers an intrinsic balance between action sampling rates and the cumulative progress-to-noise ratio, which governs convergence behavior. This result substantially extends the theoretical applicability of stochastic gradient methods to high-noise, nonsmooth bandit environments.
📝 Abstract
We provide a new understanding of the stochastic gradient bandit algorithm by showing that it converges to a globally optimal policy almost surely using emph{any} constant learning rate. This result demonstrates that the stochastic gradient algorithm continues to balance exploration and exploitation appropriately even in scenarios where standard smoothness and noise control assumptions break down. The proofs are based on novel findings about action sampling rates and the relationship between cumulative progress and noise, and extend the current understanding of how simple stochastic gradient methods behave in bandit settings.