Adaptive Heavy-Tailed Stochastic Gradient Descent

📅 2025-08-29
📈 Citations: 0
✨ Influential: 0
📄 PDF
🤖 AI Summary
Deep neural network training suffers from poor generalization and susceptibility to sharp minima, largely due to heavy-tailed gradient noise and the “Edge of Stability” (EoS) phenomenon—characterized by increasing Hessian curvature during early training—which jointly impede convergence to wide, flat minima. Method: We propose the first adaptive optimization framework that dynamically modulates gradient noise tail behavior based on EoS dynamics: injecting heavy-tailed noise initially to enhance global exploration, then progressively transitioning to light-tailed noise as curvature evolves to accelerate local convergence. Our method integrates seamlessly into stochastic gradient descent without requiring additional hyperparameter tuning. Results: Extensive experiments on MNIST, CIFAR-10, and SVHN demonstrate significant improvements over SGD and state-of-the-art noisy optimizers—particularly under poor initialization—while exhibiting superior robustness to label noise and enhanced generalization on clean data.

Technology Category

Application Category

📝 Abstract
In the era of large-scale neural network models, optimization algorithms often struggle with generalization due to an overreliance on training loss. One key insight widely accepted in the machine learning community is the idea that wide basins (regions around a local minimum where the loss increases gradually) promote better generalization by offering greater stability to small changes in input data or model parameters. In contrast, sharp minima are typically more sensitive and less stable. Motivated by two key empirical observations - the inherent heavy-tailed distribution of gradient noise in stochastic gradient descent and the Edge of Stability phenomenon during neural network training, in which curvature grows before settling at a plateau, we introduce Adaptive Heavy Tailed Stochastic Gradient Descent (AHTSGD). The algorithm injects heavier-tailed noise into the optimizer during the early stages of training to enhance exploration and gradually transitions to lighter-tailed noise as sharpness stabilizes. By dynamically adapting to the sharpness of the loss landscape throughout training, AHTSGD promotes accelerated convergence to wide basins. AHTSGD is the first algorithm to adjust the nature of injected noise into an optimizer based on the Edge of Stability phenomenon. AHTSGD consistently outperforms SGD and other noise-based methods on benchmarks like MNIST and CIFAR-10, with marked gains on noisy datasets such as SVHN. It ultimately accelerates early training from poor initializations and improves generalization across clean and noisy settings, remaining robust to learning rate choices.
Problem

Research questions and friction points this paper is trying to address.

Addresses poor generalization in large-scale neural network optimization
Mitigates sensitivity to sharp minima by promoting wide basins
Dynamically adapts noise injection based on loss landscape sharpness
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive heavy-tailed noise injection for early exploration
Dynamic noise transition based on sharpness stabilization
First algorithm leveraging Edge of Stability phenomenon
🔎 Similar Papers
No similar papers found.
B
Bodu Gong
School of Mathematics and Statistics, University of New South Wales, Sydney, Australia
Gustavo Enrique Batista
Gustavo Enrique Batista
Associate Professor, School of Computer Science and Engineering, University of New South Wales
Machine LearningData MiningQuantificationTime SeriesData Streams
P
Pierre Lafaye de Micheaux
School of Mathematics and Statistics, University of New South Wales, Sydney, Australia