🤖 AI Summary
This work addresses the unclear theoretical mechanisms of classical momentum methods—such as Polyak’s heavy ball and Nesterov acceleration—in stochastic minibatch optimization, particularly under general noise assumptions and arbitrary batch sizes. Under the interpolation condition and for quadratic objectives, the paper establishes a unified theoretical framework that reveals, for the first time, how momentum-induced acceleration scales linearly with batch size until it saturates, thereby achieving perfect parallelization of minibatch computation. Building on this framework, the authors derive a simple yet effective strategy for selecting momentum parameters, rigorously proving acceleration for any batch size under minimal noise assumptions. Experimental results further validate the practical efficacy of the proposed parameter choices.
📝 Abstract
Accelerating stochastic gradient methods with classical momentum schemes, such as Polyak's heavy ball, has proven highly successful in training large-scale machine learning models, particularly when combined with the hardware acceleration of large mini-batch computations. Yet, the effect of classical momentum on stochastic mini-batch optimization has been poorly understood theoretically, with prior works requiring strong noise assumptions and extremely large mini-batches. In this work, we develop a general theory of stochastic momentum acceleration for optimizing over quadratics in the interpolation regime, a popular abstraction for studying deep learning dynamics which also includes classical methods such as randomized Kaczmarz and coordinate descent. Our framework encompasses both heavy ball and Nesterov-style momentum, allows for arbitrary mini-batch sizes, and makes minimal assumptions on the stochastic noise. In particular, we show that acceleration from classical momentum is directly proportional to the gradient mini-batch size (up to a natural saturation point), thereby enabling perfect parallelization of mini-batch computations. Our theory also provides a simple choice for the momentum parameter, which is shown to be effective empirically.