A Diffusion Analysis of Policy Gradient for Stochastic Bandits

📅 2026-03-10

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

This work investigates the relationship between learning rate and regret performance of policy gradient methods in stochastic multi-armed bandits. By developing the first theoretical analysis framework based on continuous-time diffusion approximations—integrating stochastic process theory with regret bound derivation techniques—the study reveals the critical influence of the learning rate η and the minimal reward gap Δ on regret. The main contributions include establishing that when η = O(Δ² / log n), the algorithm achieves a regret upper bound of O(k log k log n / η); furthermore, the authors construct a counterexample demonstrating that if η does not satisfy η = O(Δ²), regret grows linearly, thereby precisely characterizing the learning rate conditions necessary for achieving sublinear regret.

Technology Category

Application Category

📝 Abstract

We study a continuous-time diffusion approximation of policy gradient for $k$-armed stochastic bandits. We prove that with a learning rate $\eta = O(\Delta^2/\log(n))$ the regret is $O(k \log(k) \log(n) / \eta)$ where $n$ is the horizon and $\Delta$ the minimum gap. Moreover, we construct an instance with only logarithmically many arms for which the regret is linear unless $\eta = O(\Delta^2)$.

Problem

Research questions and friction points this paper is trying to address.

policy gradient

stochastic bandits

diffusion approximation

regret analysis

learning rate

Innovation

Methods, ideas, or system contributions that make the work stand out.

diffusion approximation

policy gradient

stochastic bandits