🤖 AI Summary
This work addresses the fundamental question of whether stochastic gradient descent with momentum (SGDM) can accelerate training without compromising generalization performance. To bridge the gap in existing theory, which lacks rigorous guarantees for SGDM’s generalization, the authors establish, for the first time, a tight uniform stability bound for a broad class of SGDM algorithms—including both Polyak’s and Nesterov’s momentum—within the algorithmic stability framework, without requiring Lipschitz continuity of the loss function. Their analysis unifies the treatment of these two dominant momentum mechanisms. By combining this stability result with optimization error bounds and a decomposition of generalization error, they derive an optimal excess risk upper bound for SGDM under convex and smooth assumptions, thereby theoretically demonstrating that momentum can indeed accelerate convergence while preserving favorable generalization guarantees.
📝 Abstract
Stochastic gradient descent with momentum (SGDM) is one of the most widely used optimization algorithms in machine learning. While optimization properties of SGDM have been extensively studied in the literature, it remains insufficiently understood whether and when SGDM can generalize well to unseen data. In particular, it has been conjectured that while momentum accelerates training, it may degrade generalization. In this paper, we close this gap by developing a comprehensive generalization analysis of SGDM through the lens of algorithmic stability. More specifically, we introduce a generalized SGDM framework that encompasses both Polyak's and Nesterov's momentum schemes, and establish tight on-average model stability bounds for smooth and convex problems. Notably, the obtained bounds exploit small optimization error bounds along the trajectory, apply to any momentum parameter in the interval $[0, 1)$, and do not require the commonly assumed Lipschitzness of loss functions. We further derive optimization error bounds for the generalized SGDM, and combine them with our generalization analyses to obtain optimal excess population risk bounds for SGDM with both Polyak's and Nesterov's momentum.