🤖 AI Summary
This work investigates the relationship between learning rate and regret performance of policy gradient methods in stochastic multi-armed bandits. By developing the first theoretical analysis framework based on continuous-time diffusion approximations—integrating stochastic process theory with regret bound derivation techniques—the study reveals the critical influence of the learning rate η and the minimal reward gap Δ on regret. The main contributions include establishing that when η = O(Δ² / log n), the algorithm achieves a regret upper bound of O(k log k log n / η); furthermore, the authors construct a counterexample demonstrating that if η does not satisfy η = O(Δ²), regret grows linearly, thereby precisely characterizing the learning rate conditions necessary for achieving sublinear regret.
📝 Abstract
We study a continuous-time diffusion approximation of policy gradient for $k$-armed stochastic bandits. We prove that with a learning rate $\eta = O(\Delta^2/\log(n))$ the regret is $O(k \log(k) \log(n) / \eta)$ where $n$ is the horizon and $\Delta$ the minimum gap. Moreover, we construct an instance with only logarithmically many arms for which the regret is linear unless $\eta = O(\Delta^2)$.