🤖 AI Summary
This work addresses theoretical limitations in the asymptotic convergence analysis of stochastic gradient descent (SGD) for nonconvex stochastic optimization. Classical analyses rely on the Robbins–Monro step-size condition (e.g., ∑εₜ² < ∞) and strong regularity assumptions—namely, globally Lipschitz continuous gradients and bounded higher-order moments. We propose a novel analytical framework grounded in stopping-time arguments and martingale convergence theory. For the first time, under the significantly relaxed step-size conditions ∑εₜ = ∞ and ∑εₜᵖ < ∞ for some p > 2, we rigorously establish almost-sure convergence of the SGD iterates to critical points and derive an associated L₂ convergence rate. Crucially, our analysis dispenses with the global Lipschitz gradient assumption, substantially broadening the applicability of SGD convergence theory. Moreover, it accommodates practical step-size schedules—including constant and polynomially decaying steps—enhancing alignment with empirical training practices.
📝 Abstract
Stochastic Gradient Descent (SGD) is widely used in machine learning research. Previous convergence analyses of SGD under the vanishing step-size setting typically require Robbins-Monro conditions. However, in practice, a wider variety of step-size schemes are frequently employed, yet existing convergence results remain limited and often rely on strong assumptions. This paper bridges this gap by introducing a novel analytical framework based on a stopping-time method, enabling asymptotic convergence analysis of SGD under more relaxed step-size conditions and weaker assumptions. In the non-convex setting, we prove the almost sure convergence of SGD iterates for step-sizes $ { epsilon_t }_{t geq 1} $ satisfying $sum_{t=1}^{+infty} epsilon_t = +infty$ and $sum_{t=1}^{+infty} epsilon_t^p<+infty$ for some $p>2$. Compared with previous studies, our analysis eliminates the global Lipschitz continuity assumption on the loss function and relaxes the boundedness requirements for higher-order moments of stochastic gradients. Building upon the almost sure convergence results, we further establish $L_2$ convergence. These significantly relaxed assumptions make our theoretical results more general, thereby enhancing their applicability in practical scenarios.