🤖 AI Summary
This work addresses the hitherto unexplored problem of last-iterate convergence rates for Follow-the-Regularized-Leader (FTRL)-type algorithms in stochastic multi-armed bandits. Focusing on the $1/2$-Tsallis-INF algorithm—which enjoys optimal “two-worlds” regret guarantees—we analyze the evolution of its probability mass on the optimal arm via Bregman divergence with respect to the Tsallis entropy regularizer. We rigorously establish that this Bregman divergence decays at rate $t^{-1/2}$, implying last-iterate convergence of the action distribution toward the optimal arm. This is the first result to uncover an intrinsic connection between logarithmic-scale regret and last-iterate convergence speed. It provides the first theoretical guarantee on the actual decision convergence of FTRL algorithms under stochastic bandit feedback, thereby filling a fundamental gap in the last-iterate analysis of FTRL methods.
📝 Abstract
The convergence analysis of online learning algorithms is central to machine learning theory, where last-iterate convergence is particularly important, as it captures the learner's actual decisions and describes the evolution of the learning process over time. However, in multi-armed bandits, most existing algorithmic analyses mainly focus on the order of regret, while the last-iterate (simple regret) convergence rate remains less explored -- especially for the widely studied Follow-the-Regularized-Leader (FTRL) algorithms. Recently, a growing line of work has established the Best-of-Both-Worlds (BOBW) property of FTRL algorithms in bandit problems, showing in particular that they achieve logarithmic regret in stochastic bandits. Nevertheless, their last-iterate convergence rate has not yet been studied. Intuitively, logarithmic regret should correspond to a $t^{-1}$ last-iterate convergence rate. This paper partially confirms this intuition through theoretical analysis, showing that the Bregman divergence, defined by the regular function $Ψ(p)=-4sum_{i=1}^{d}sqrt{p_i}$ associated with the BOBW FTRL algorithm $1/2$-Tsallis-INF (arXiv:1807.07623), between the point mass on the optimal arm and the probability distribution over the arm set obtained at iteration $t$, decays at a rate of $t^{-1/2}$.