🤖 AI Summary
The theoretical analysis of adaptive stochastic gradient descent (SGD) under small learning rates remains challenging due to the intricate coupling between parameter updates and adaptive second-moment estimates.
Method: We establish a rigorous continuous-time limit dynamics model via stochastic differential equations (SDEs), employing diffusion approximation, multiscale analysis, and an extended framework inspired by Malliavin calculus techniques.
Contribution/Results: We prove that, as the learning rate vanishes, the sampling noise in the joint evolution of parameters and second-moment estimates asymptotically behaves as the superposition of two independent Brownian motions. We derive precise scaling laws linking the learning rate to key hyperparameters—yielding a unified SDE limit that encompasses mainstream adaptive optimizers including Adam and RMSProp. This work provides the first theoretical characterization of the intrinsic noise structure and coupled bivariate dynamics of adaptive optimizers, delivering a tight, analytically tractable continuous approximation to their discrete-time behavior.
📝 Abstract
We present a theoretical analysis of some popular adaptive Stochastic Gradient Descent (SGD) methods in the small learning rate regime. Using the stochastic modified equations framework introduced by Li et al., we derive effective continuous stochastic dynamics for these methods. Our key contribution is that sampling-induced noise in SGD manifests in the limit as independent Brownian motions driving the parameter and gradient second momentum evolutions. Furthermore, extending the approach of Malladi et al., we investigate scaling rules between the learning rate and key hyperparameters in adaptive methods, characterising all non-trivial limiting dynamics.