🤖 AI Summary
This work challenges the conventional modeling of stochastic gradient descent (SGD) as Brownian motion, which fails to capture the discrete nature of SGD updates under finite learning rates. Starting from the discrete dynamics of SGD, the authors formulate it as a deterministic evolution perturbed by minibatch noise and derive the corresponding master equation and a discrete Fokker–Planck equation for the parameter distribution. By decomposing the dynamics in the Hessian eigenbasis, they reveal that SGD exhibits diffusive behavior along flat directions—lacking a stationary distribution and displaying variance that grows unboundedly over time—distinct from isotropic Brownian motion. Through combined theoretical analysis and empirical validation across computer vision and natural language processing models, the study clearly demonstrates a separation between confined and diffusive modes, establishing a novel non-equilibrium dynamical perspective on SGD.
📝 Abstract
Stochastic Gradient Descent (SGD) is commonly modeled as a Langevin process, assuming that minibatch noise acts as Brownian motion. However, this approximation relies on a continuous-time limit and a sqrt(eta) noise scaling that does not match the discrete SGD update at finite learning rate. In this work, we propose an alternative formulation of SGD as deterministic dynamics in a fluctuating loss landscape induced by minibatch sampling. Starting directly from the discrete update, we derive a master equation for the parameter distribution and obtain a discrete Fokker--Planck equation that differs from the standard Langevin form at order eta^2. Using this framework, we analyze SGD dynamics near critical points of the loss. We show that the behavior decomposes along the eigenbasis of the mean Hessian into qualitatively distinct regimes. In particular, nearly-flat directions do not admit a stationary distribution: the variance grows over time, corresponding to effective diffusion along valleys with a coefficient proportional to the learning rate. We provide empirical evidence supporting these predictions on neural network models in computer vision and natural language processing, observing a clear qualitative separation between confined and diffusive modes.