🤖 AI Summary
This paper challenges the classical square-summability condition (∑αₙ² < ∞) on step sizes in stochastic approximation (SA), proving for the first time that it is not necessary for convergence. Focusing on power-law step sizes αₙ = α₀n⁻ᵽ with ρ ∈ (0,1) and general Markovian noise, it establishes fine-grained characterizations of convergence, bias, and variance. Key contributions are: (1) the first necessary and sufficient condition for vanishing bias when ρ ≤ 1/2; (2) a proof that Polyak–Ruppert averaging achieves the optimal CLT covariance even for ρ ∈ (0,1/2], though bias dominance slows convergence; and (3) almost-sure and Lₚ convergence for all ρ ∈ (0,1), with an explicit characterization of mean-square error degradation to O(αₙ²). Collectively, these results fundamentally reshape the theoretical foundations of SA step-size design.
📝 Abstract
Many machine learning and optimization algorithms are built upon the framework of stochastic approximation (SA), for which the selection of step-size (or learning rate) is essential for success. For the sake of clarity, this paper focuses on the special case $alpha_n = alpha_0 n^{-
ho}$ at iteration $n$, with $
ho in [0,1]$ and $alpha_0>0$ design parameters. It is most common in practice to take $
ho=0$ (constant step-size), while in more theoretically oriented papers a vanishing step-size is preferred. In particular, with $
ho in (1/2, 1)$ it is known that on applying the averaging technique of Polyak and Ruppert, the mean-squared error (MSE) converges at the optimal rate of $O(1/n)$ and the covariance in the central limit theorem (CLT) is minimal in a precise sense. The paper revisits step-size selection in a general Markovian setting. Under readily verifiable assumptions, the following conclusions are obtained provided $0<
ho<1$: $ullet$ Parameter estimates converge with probability one, and also in $L_p$ for any $pge 1$. $ullet$ The MSE may converge very slowly for small $
ho$, of order $O(alpha_n^2)$ even with averaging. $ullet$ For linear stochastic approximation the source of slow convergence is identified: for any $
hoin (0,1)$, averaging results in estimates for which the error $ extit{covariance}$ vanishes at the optimal rate, and moreover the CLT covariance is optimal in the sense of Polyak and Ruppert. However, necessary and sufficient conditions are obtained under which the $ extit{bias}$ converges to zero at rate $O(alpha_n)$. This is the first paper to obtain such strong conclusions while allowing for $
ho le 1/2$. A major conclusion is that the choice of $
ho =0$ or even $
ho<1/2$ is justified only in select settings -- In general, bias may preclude fast convergence.