🤖 AI Summary
This work investigates the theoretical convergence and regret bounds of Softmax policy gradient methods in discrete-time stochastic multi-armed bandits. Addressing the limitation that existing analyses predominantly rely on continuous-time frameworks, we establish the first theoretical framework for discrete-time settings by leveraging Lyapunov stability analysis to characterize algorithmic dynamics. By integrating policy gradient theory, Softmax parameterization, and stochastic process techniques, we elucidate the critical roles of the learning rate and reward gap in algorithmic performance. Specifically, we prove that with a learning rate η = O(Δ_min² / (Δ_max log n)), the algorithm achieves a regret upper bound of O(k log k log n / η), where k denotes the number of arms and n the time horizon. This study provides the first rigorous theoretical guarantee for policy gradient methods in discrete-time bandit problems.
📝 Abstract
We adapt the analysis of policy gradient for continuous time $k$-armed stochastic bandits by Lattimore (2026) to the standard discrete time setup. As in continuous time, we prove that with learning rate $η= O(Δ_{\min}^2/(Δ_{\max} \log(n)))$ the regret is $O(k \log(k) \log(n) / η)$ where $n$ is the horizon and $Δ_{\min}$ and $Δ_{\max}$ are the minimum and maximum gaps.