Algorithmic Stability of Stochastic Gradient Descent with Momentum under Heavy-Tailed Noise

📅 2025-02-02

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

This work investigates the generalization stability of momentum stochastic gradient descent (SGDm) under heavy-tailed gradient noise. Methodologically, we first derive its continuous-time limit as a Lévy-driven stochastic differential equation (SDE), then establish quantitative Wasserstein stability bounds for the discrete algorithm. We reveal that the coupling between momentum and heavy-tailed noise severely degrades generalization: theoretically, SGDm exhibits a strictly worse generalization bound than standard SGD under such noise. Furthermore, we present the first uniform-in-time discretization error bound for degenerate Lévy-driven SDEs, proving that with appropriate step sizes, discrete iterates inherit the stability and generalization guarantees of their continuous limit. Our theoretical findings are empirically validated on quadratic loss functions and multilayer neural networks. Collectively, this work provides a novel theoretical foundation for designing optimization algorithms robust to heavy-tailed gradient noise.

Technology Category

Application Category

📝 Abstract

Understanding the generalization properties of optimization algorithms under heavy-tailed noise has gained growing attention. However, the existing theoretical results mainly focus on stochastic gradient descent (SGD) and the analysis of heavy-tailed optimizers beyond SGD is still missing. In this work, we establish generalization bounds for SGD with momentum (SGDm) under heavy-tailed gradient noise. We first consider the continuous-time limit of SGDm, i.e., a Levy-driven stochastic differential equation (SDE), and establish quantitative Wasserstein algorithmic stability bounds for a class of potentially non-convex loss functions. Our bounds reveal a remarkable observation: For quadratic loss functions, we show that SGDm admits a worse generalization bound in the presence of heavy-tailed noise, indicating that the interaction of momentum and heavy tails can be harmful for generalization. We then extend our analysis to discrete-time and develop a uniform-in-time discretization error bound, which, to our knowledge, is the first result of its kind for SDEs with degenerate noise. This result shows that, with appropriately chosen step-sizes, the discrete dynamics retain the generalization properties of the limiting SDE. We illustrate our theory on both synthetic quadratic problems and neural networks.

Problem

Research questions and friction points this paper is trying to address.

Momentum Stochastic Gradient Descent

Heavy-tailed Noise

Robustness

Innovation

Methods, ideas, or system contributions that make the work stand out.

Heavy-tailed Noise

Stochastic Gradient Descent with Momentum (SGDm)

Unified Error Bound

🔎 Similar Papers

Role of Momentum in Smoothing Objective Function and Generalizability of Deep Neural Networks