Dimension-adapted Momentum Outscales SGD

📅 2025-05-22

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

This work investigates the scaling laws of mini-batch stochastic momentum methods under the power-law random features model, focusing on how data complexity, target complexity, and model dimensionality govern loss decay dynamics. We propose Dimension-Adaptive Nesterov Acceleration (DANA), a novel momentum scheduling mechanism that exploits the previously unrecognized “momentum outscaling” phenomenon: by dynamically scaling the momentum parameter with respect to model dimension and data complexity, DANA significantly improves the exponential decay rate of the loss. Theoretically, DANA breaks the fundamental scaling bottleneck of standard SGD with momentum (SGD-M). Empirically, its predicted scaling laws are precisely validated on high-dimensional synthetic quadratic objectives, and it consistently accelerates convergence in LSTM-based language modeling—achieving computationally optimal scaling. Our core contribution is establishing a new paradigm for momentum scaling and uncovering the complexity-driven nature of optimization dynamics in overparameterized learning.

Technology Category

Application Category

📝 Abstract

We investigate scaling laws for stochastic momentum algorithms with small batch on the power law random features model, parameterized by data complexity, target complexity, and model size. When trained with a stochastic momentum algorithm, our analysis reveals four distinct loss curve shapes determined by varying data-target complexities. While traditional stochastic gradient descent with momentum (SGD-M) yields identical scaling law exponents to SGD, dimension-adapted Nesterov acceleration (DANA) improves these exponents by scaling momentum hyperparameters based on model size and data complexity. This outscaling phenomenon, which also improves compute-optimal scaling behavior, is achieved by DANA across a broad range of data and target complexities, while traditional methods fall short. Extensive experiments on high-dimensional synthetic quadratics validate our theoretical predictions and large-scale text experiments with LSTMs show DANA's improved loss exponents over SGD hold in a practical setting.

Problem

Research questions and friction points this paper is trying to address.

Study scaling laws for stochastic momentum algorithms

Analyze loss curve shapes with varying complexities

Improve exponents via dimension-adapted Nesterov acceleration

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dimension-adapted Nesterov acceleration (DANA) improves scaling

DANA scales momentum hyperparameters by model size

DANA outperforms SGD in various data complexities

🔎 Similar Papers

Role of Momentum in Smoothing Objective Function and Generalizability of Deep Neural Networks