High-dimensional limit theorems for SGD: Momentum and Adaptive Step-sizes

📅 2025-11-06

📈 Citations: 0

✨ Influential: 0

career value

231K/year

🤖 AI Summary

This paper investigates the limiting dynamics and comparative performance of SGD with momentum (SGD-M) versus adaptive-step-size algorithms in high-dimensional stochastic optimization. Employing high-dimensional asymptotic analysis, stochastic differential equation modeling, and time-scale separation techniques, we derive continuous-time limit models. We establish, for the first time, rigorous equivalence conditions between SGD-M and online SGD in the high-dimensional regime, revealing that momentum amplifies high-dimensional bias. We propose a normalized adaptive step-size mechanism that substantially enlarges the stable convergence step-size regime while accelerating convergence and enhancing robustness—providing theoretical grounding for early preconditioning. Empirical validation on Spiked Tensor PCA and the single-index model confirms that, after proper parameter calibration, SGD-M performs equivalently to standard SGD; in contrast, the normalized adaptive variant achieves superior proximity to the global optimum, with enhanced stability and generalization potential.

Technology Category

Application Category

📝 Abstract

We develop a high-dimensional scaling limit for Stochastic Gradient Descent with Polyak Momentum (SGD-M) and adaptive step-sizes. This provides a framework to rigourously compare online SGD with some of its popular variants. We show that the scaling limits of SGD-M coincide with those of online SGD after an appropriate time rescaling and a specific choice of step-size. However, if the step-size is kept the same between the two algorithms, SGD-M will amplify high-dimensional effects, potentially degrading performance relative to online SGD. We demonstrate our framework on two popular learning problems: Spiked Tensor PCA and Single Index Models. In both cases, we also examine online SGD with an adaptive step-size based on normalized gradients. In the high-dimensional regime, this algorithm yields multiple benefits: its dynamics admit fixed points closer to the population minimum and widens the range of admissible step-sizes for which the iterates converge to such solutions. These examples provide a rigorous account, aligning with empirical motivation, of how early preconditioners can stabilize and improve dynamics in settings where online SGD fails.

Problem

Research questions and friction points this paper is trying to address.

Develops high-dimensional scaling limits for SGD with momentum and adaptive step-sizes

Compares online SGD with variants like momentum and normalized gradients

Examines algorithm performance on Spiked Tensor PCA and Single Index Models

Innovation

Methods, ideas, or system contributions that make the work stand out.

High-dimensional scaling limit for SGD-M

Adaptive step-sizes based on normalized gradients

Early preconditioners stabilize SGD dynamics

🔎 Similar Papers

Convergence of SGD with momentum in the nonconvex case: A novel time window-based analysis