Why SGD is not Brownian Motion: A New Perspective on Stochastic Dynamics

📅 2026-05-21
📈 Citations: 0
Influential: 0
📄 PDF

career value

234K/year
🤖 AI Summary
This work challenges the conventional modeling of stochastic gradient descent (SGD) as Brownian motion, which fails to capture the discrete nature of SGD updates under finite learning rates. Starting from the discrete dynamics of SGD, the authors formulate it as a deterministic evolution perturbed by minibatch noise and derive the corresponding master equation and a discrete Fokker–Planck equation for the parameter distribution. By decomposing the dynamics in the Hessian eigenbasis, they reveal that SGD exhibits diffusive behavior along flat directions—lacking a stationary distribution and displaying variance that grows unboundedly over time—distinct from isotropic Brownian motion. Through combined theoretical analysis and empirical validation across computer vision and natural language processing models, the study clearly demonstrates a separation between confined and diffusive modes, establishing a novel non-equilibrium dynamical perspective on SGD.
📝 Abstract
Stochastic Gradient Descent (SGD) is commonly modeled as a Langevin process, assuming that minibatch noise acts as Brownian motion. However, this approximation relies on a continuous-time limit and a sqrt(eta) noise scaling that does not match the discrete SGD update at finite learning rate. In this work, we propose an alternative formulation of SGD as deterministic dynamics in a fluctuating loss landscape induced by minibatch sampling. Starting directly from the discrete update, we derive a master equation for the parameter distribution and obtain a discrete Fokker--Planck equation that differs from the standard Langevin form at order eta^2. Using this framework, we analyze SGD dynamics near critical points of the loss. We show that the behavior decomposes along the eigenbasis of the mean Hessian into qualitatively distinct regimes. In particular, nearly-flat directions do not admit a stationary distribution: the variance grows over time, corresponding to effective diffusion along valleys with a coefficient proportional to the learning rate. We provide empirical evidence supporting these predictions on neural network models in computer vision and natural language processing, observing a clear qualitative separation between confined and diffusive modes.
Problem

Research questions and friction points this paper is trying to address.

Stochastic Gradient Descent
Brownian Motion
Langevin Dynamics
Discrete Dynamics
Loss Landscape
Innovation

Methods, ideas, or system contributions that make the work stand out.

Stochastic Gradient Descent
Discrete Fokker–Planck Equation
Fluctuating Loss Landscape
Diffusive Dynamics
Hessian Eigenbasis
I
Igor Ignashin
Basic Research of Artificial Intelligence Laboratory (BRAIn Lab)
A
Anna Radovskaya
Basic Research of Artificial Intelligence Laboratory (BRAIn Lab); P.N. Lebedev Physical Institute of the Russian Academy of Sciences
A
Andrew Semenov
Basic Research of Artificial Intelligence Laboratory (BRAIn Lab); P.N. Lebedev Physical Institute of the Russian Academy of Sciences
E
Egor Lopatin
Basic Research of Artificial Intelligence Laboratory (BRAIn Lab)
S
Stanislav Potapov
Basic Research of Artificial Intelligence Laboratory (BRAIn Lab)
Aleksandr Kovalenko
Aleksandr Kovalenko
AIRI
Andrey Veprikov
Andrey Veprikov
Unknown affiliation
OptimizationMLDL
Aleksandr Shestakov
Aleksandr Shestakov
Institute of Artificial Intelligence
optimizationprobability
A
Andrey Leonidov
Basic Research of Artificial Intelligence Laboratory (BRAIn Lab); P.N. Lebedev Physical Institute of the Russian Academy of Sciences
Aleksandr Beznosikov
Aleksandr Beznosikov
PhD, Basic Research of Artificial Intelligence Lab
OptimizationMachine Learning