🤖 AI Summary
This paper investigates the limiting dynamics and comparative performance of SGD with momentum (SGD-M) versus adaptive-step-size algorithms in high-dimensional stochastic optimization. Employing high-dimensional asymptotic analysis, stochastic differential equation modeling, and time-scale separation techniques, we derive continuous-time limit models. We establish, for the first time, rigorous equivalence conditions between SGD-M and online SGD in the high-dimensional regime, revealing that momentum amplifies high-dimensional bias. We propose a normalized adaptive step-size mechanism that substantially enlarges the stable convergence step-size regime while accelerating convergence and enhancing robustness—providing theoretical grounding for early preconditioning. Empirical validation on Spiked Tensor PCA and the single-index model confirms that, after proper parameter calibration, SGD-M performs equivalently to standard SGD; in contrast, the normalized adaptive variant achieves superior proximity to the global optimum, with enhanced stability and generalization potential.
📝 Abstract
We develop a high-dimensional scaling limit for Stochastic Gradient Descent with Polyak Momentum (SGD-M) and adaptive step-sizes. This provides a framework to rigourously compare online SGD with some of its popular variants. We show that the scaling limits of SGD-M coincide with those of online SGD after an appropriate time rescaling and a specific choice of step-size. However, if the step-size is kept the same between the two algorithms, SGD-M will amplify high-dimensional effects, potentially degrading performance relative to online SGD. We demonstrate our framework on two popular learning problems: Spiked Tensor PCA and Single Index Models. In both cases, we also examine online SGD with an adaptive step-size based on normalized gradients. In the high-dimensional regime, this algorithm yields multiple benefits: its dynamics admit fixed points closer to the population minimum and widens the range of admissible step-sizes for which the iterates converge to such solutions. These examples provide a rigorous account, aligning with empirical motivation, of how early preconditioners can stabilize and improve dynamics in settings where online SGD fails.