A Diffusion Analysis of Policy Gradient for Stochastic Bandits

📅 2026-03-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates the relationship between learning rate and regret performance of policy gradient methods in stochastic multi-armed bandits. By developing the first theoretical analysis framework based on continuous-time diffusion approximations—integrating stochastic process theory with regret bound derivation techniques—the study reveals the critical influence of the learning rate η and the minimal reward gap Δ on regret. The main contributions include establishing that when η = O(Δ² / log n), the algorithm achieves a regret upper bound of O(k log k log n / η); furthermore, the authors construct a counterexample demonstrating that if η does not satisfy η = O(Δ²), regret grows linearly, thereby precisely characterizing the learning rate conditions necessary for achieving sublinear regret.

Technology Category

Application Category

📝 Abstract
We study a continuous-time diffusion approximation of policy gradient for $k$-armed stochastic bandits. We prove that with a learning rate $\eta = O(\Delta^2/\log(n))$ the regret is $O(k \log(k) \log(n) / \eta)$ where $n$ is the horizon and $\Delta$ the minimum gap. Moreover, we construct an instance with only logarithmically many arms for which the regret is linear unless $\eta = O(\Delta^2)$.
Problem

Research questions and friction points this paper is trying to address.

policy gradient
stochastic bandits
diffusion approximation
regret analysis
learning rate
Innovation

Methods, ideas, or system contributions that make the work stand out.

diffusion approximation
policy gradient
stochastic bandits
regret analysis
learning rate
🔎 Similar Papers
No similar papers found.