🤖 AI Summary
This work investigates the generalization stability of momentum stochastic gradient descent (SGDm) under heavy-tailed gradient noise. Methodologically, we first derive its continuous-time limit as a Lévy-driven stochastic differential equation (SDE), then establish quantitative Wasserstein stability bounds for the discrete algorithm. We reveal that the coupling between momentum and heavy-tailed noise severely degrades generalization: theoretically, SGDm exhibits a strictly worse generalization bound than standard SGD under such noise. Furthermore, we present the first uniform-in-time discretization error bound for degenerate Lévy-driven SDEs, proving that with appropriate step sizes, discrete iterates inherit the stability and generalization guarantees of their continuous limit. Our theoretical findings are empirically validated on quadratic loss functions and multilayer neural networks. Collectively, this work provides a novel theoretical foundation for designing optimization algorithms robust to heavy-tailed gradient noise.
📝 Abstract
Understanding the generalization properties of optimization algorithms under heavy-tailed noise has gained growing attention. However, the existing theoretical results mainly focus on stochastic gradient descent (SGD) and the analysis of heavy-tailed optimizers beyond SGD is still missing. In this work, we establish generalization bounds for SGD with momentum (SGDm) under heavy-tailed gradient noise. We first consider the continuous-time limit of SGDm, i.e., a Levy-driven stochastic differential equation (SDE), and establish quantitative Wasserstein algorithmic stability bounds for a class of potentially non-convex loss functions. Our bounds reveal a remarkable observation: For quadratic loss functions, we show that SGDm admits a worse generalization bound in the presence of heavy-tailed noise, indicating that the interaction of momentum and heavy tails can be harmful for generalization. We then extend our analysis to discrete-time and develop a uniform-in-time discretization error bound, which, to our knowledge, is the first result of its kind for SDEs with degenerate noise. This result shows that, with appropriately chosen step-sizes, the discrete dynamics retain the generalization properties of the limiting SDE. We illustrate our theory on both synthetic quadratic problems and neural networks.