🤖 AI Summary
Constant-step-size SGD and Ruppert–Polyak averaged SGD (ASGD) lack rigorous statistical guarantees in high-dimensional settings. Method: This paper pioneers the integration of high-dimensional time series analysis into online optimization, modeling SGD as a nonlinear autoregressive process and establishing asymptotic stationarity and moment convergence of its iterate sequence. It introduces the novel notion of “geometric moment contraction” and leverages coupling techniques together with high-dimensional concentration inequalities. Contribution/Results: The work delivers the first general ℓ^s-norm q-th moment convergence guarantee for constant-step-size SGD and ASGD, and derives sharp high-probability error bounds under the ℓ^∞-norm. These results fill a critical theoretical gap in the high-dimensional statistical analysis of constant-step-size SGD and ASGD, providing a rigorous foundation for large-scale online learning.
📝 Abstract
Stochastic Gradient Descent (SGD) and its Ruppert-Polyak averaged variant (ASGD) lie at the heart of modern large-scale learning, yet their theoretical properties in high-dimensional settings are rarely understood. In this paper, we provide rigorous statistical guarantees for constant learning-rate SGD and ASGD in high-dimensional regimes. Our key innovation is to transfer powerful tools from high-dimensional time series to online learning. Specifically, by viewing SGD as a nonlinear autoregressive process and adapting existing coupling techniques, we prove the geometric-moment contraction of high-dimensional SGD for constant learning rates, thereby establishing asymptotic stationarity of the iterates. Building on this, we derive the $q$-th moment convergence of SGD and ASGD for any $qge2$ in general $ell^s$-norms, and, in particular, the $ell^{infty}$-norm that is frequently adopted in high-dimensional sparse or structured models. Furthermore, we provide sharp high-probability concentration analysis which entails the probabilistic bound of high-dimensional ASGD. Beyond closing a critical gap in SGD theory, our proposed framework offers a novel toolkit for analyzing a broad class of high-dimensional learning algorithms.