Adaptive Heavy-Tailed Stochastic Gradient Descent

📅 2025-08-29

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

Deep neural network training suffers from poor generalization and susceptibility to sharp minima, largely due to heavy-tailed gradient noise and the “Edge of Stability” (EoS) phenomenon—characterized by increasing Hessian curvature during early training—which jointly impede convergence to wide, flat minima. Method: We propose the first adaptive optimization framework that dynamically modulates gradient noise tail behavior based on EoS dynamics: injecting heavy-tailed noise initially to enhance global exploration, then progressively transitioning to light-tailed noise as curvature evolves to accelerate local convergence. Our method integrates seamlessly into stochastic gradient descent without requiring additional hyperparameter tuning. Results: Extensive experiments on MNIST, CIFAR-10, and SVHN demonstrate significant improvements over SGD and state-of-the-art noisy optimizers—particularly under poor initialization—while exhibiting superior robustness to label noise and enhanced generalization on clean data.

Technology Category

Application Category

📝 Abstract

In the era of large-scale neural network models, optimization algorithms often struggle with generalization due to an overreliance on training loss. One key insight widely accepted in the machine learning community is the idea that wide basins (regions around a local minimum where the loss increases gradually) promote better generalization by offering greater stability to small changes in input data or model parameters. In contrast, sharp minima are typically more sensitive and less stable. Motivated by two key empirical observations - the inherent heavy-tailed distribution of gradient noise in stochastic gradient descent and the Edge of Stability phenomenon during neural network training, in which curvature grows before settling at a plateau, we introduce Adaptive Heavy Tailed Stochastic Gradient Descent (AHTSGD). The algorithm injects heavier-tailed noise into the optimizer during the early stages of training to enhance exploration and gradually transitions to lighter-tailed noise as sharpness stabilizes. By dynamically adapting to the sharpness of the loss landscape throughout training, AHTSGD promotes accelerated convergence to wide basins. AHTSGD is the first algorithm to adjust the nature of injected noise into an optimizer based on the Edge of Stability phenomenon. AHTSGD consistently outperforms SGD and other noise-based methods on benchmarks like MNIST and CIFAR-10, with marked gains on noisy datasets such as SVHN. It ultimately accelerates early training from poor initializations and improves generalization across clean and noisy settings, remaining robust to learning rate choices.

Problem

Research questions and friction points this paper is trying to address.

Addresses poor generalization in large-scale neural network optimization

Mitigates sensitivity to sharp minima by promoting wide basins

Dynamically adapts noise injection based on loss landscape sharpness

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive heavy-tailed noise injection for early exploration

Dynamic noise transition based on sharpness stabilization

First algorithm leveraging Edge of Stability phenomenon

🔎 Similar Papers

No similar papers found.