A Lyapunov Analysis of Softmax Policy Gradient for Stochastic Bandits

📅 2026-03-27

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

This work investigates the theoretical convergence and regret bounds of Softmax policy gradient methods in discrete-time stochastic multi-armed bandits. Addressing the limitation that existing analyses predominantly rely on continuous-time frameworks, we establish the first theoretical framework for discrete-time settings by leveraging Lyapunov stability analysis to characterize algorithmic dynamics. By integrating policy gradient theory, Softmax parameterization, and stochastic process techniques, we elucidate the critical roles of the learning rate and reward gap in algorithmic performance. Specifically, we prove that with a learning rate η = O(Δ_min² / (Δ_max log n)), the algorithm achieves a regret upper bound of O(k log k log n / η), where k denotes the number of arms and n the time horizon. This study provides the first rigorous theoretical guarantee for policy gradient methods in discrete-time bandit problems.

Technology Category

Application Category

📝 Abstract

We adapt the analysis of policy gradient for continuous time $k$-armed stochastic bandits by Lattimore (2026) to the standard discrete time setup. As in continuous time, we prove that with learning rate $η= O(Δ_{\min}^2/(Δ_{\max} \log(n)))$ the regret is $O(k \log(k) \log(n) / η)$ where $n$ is the horizon and $Δ_{\min}$ and $Δ_{\max}$ are the minimum and maximum gaps.

Problem

Research questions and friction points this paper is trying to address.

stochastic bandits

policy gradient

regret analysis

softmax

Lyapunov analysis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Lyapunov analysis

softmax policy gradient

stochastic bandits