Transient learning dynamics drive escape from sharp valleys in Stochastic Gradient Descent

📅 2026-01-16
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the unresolved dynamical mechanism underlying stochastic gradient descent’s (SGD) preference for flat minima that exhibit better generalization. By integrating analytically tractable physical models with numerical experiments, we demonstrate that SGD undergoes a brief exploratory phase early in training, during which its intrinsic noise effectively reshapes the loss landscape into an “effective potential,” steering trajectories away from sharp minima toward flatter regions. In later stages, rising energy barriers cause the dynamics to freeze within a single basin. We propose a nonequilibrium perspective to explain SGD’s solution selection behavior, clarifying the role of transient freezing and the critical influence of noise magnitude in converging to flat, generalizable solutions. This framework unifies learning dynamics, loss landscape geometry, and generalization performance, offering theoretical foundations for designing optimization algorithms with enhanced generalization capabilities.

Technology Category

Application Category

📝 Abstract
Stochastic gradient descent (SGD) is central to deep learning, yet the dynamical origin of its preference for flatter, more generalizable solutions remains unclear. Here, by analyzing SGD learning dynamics, we identify a nonequilibrium mechanism governing solution selection. Numerical experiments reveal a transient exploratory phase in which SGD trajectories repeatedly escape sharp valleys and transition toward flatter regions of the loss landscape. By using a tractable physical model, we show that the SGD noise reshapes the landscape into an effective potential that favors flat solutions. Crucially, we uncover a transient freezing mechanism: as training proceeds, growing energy barriers suppress inter-valley transitions and ultimately trap the dynamics within a single basin. Increasing the SGD noise strength delays this freezing, which enhances convergence to flatter minima. Together, these results provide a unified physical framework linking learning dynamics, loss-landscape geometry, and generalization, and suggest principles for the design of more effective optimization algorithms.
Problem

Research questions and friction points this paper is trying to address.

stochastic gradient descent
loss landscape
flat minima
learning dynamics
generalization
Innovation

Methods, ideas, or system contributions that make the work stand out.

transient dynamics
SGD noise
loss landscape
flat minima
nonequilibrium mechanism
N
Ning Yang
Peking University Chengdu Academy for Advanced Interdisciplinary Biotechnologies, Chengdu 610213, China
Y
Yikuan Zhang
School of Physics, Peking University, Beijing 100871, China
O
Ouyang Qi
Institute for Advanced Study in Physics, Zhejiang University, Hangzhou 310058, China
C
Chao Tang
Peking University Chengdu Academy for Advanced Interdisciplinary Biotechnologies, Chengdu 610213, China; Center for Quantitative Biology, Peking University, Beijing 100871, China
Yuhai Tu
Yuhai Tu
Senior Research Scientist, Flatiron Institute
Statistical PhysicsBiophysicsSystems Biology