🤖 AI Summary
This work investigates the asymptotic learning dynamics of multi-pass mini-batch stochastic gradient descent (SGD) in high-dimensional multi-index models under the proportional scaling regime where both sample size and dimensionality grow simultaneously. For isotropic random data, a scalar Poisson jump process is introduced to characterize the limiting behavior of sampling noise, enabling an exact coordinate-wise description of the dynamics. The study derives, for the first time, asymptotically precise mean-field equations for multi-pass SGD with arbitrary sublinear batch sizes, elucidating the fundamental connections and distinctions among SGD, stochastic modified equations (SME), and gradient flow. Theoretically, it is established that under appropriate learning rate scaling, SGD trajectories with different batch sizes converge to the same limiting dynamics; in linear models, SGD and SME are equivalent; and known limiting cases of both gradient flow and single-pass SGD are consistently recovered within this unified framework.
📝 Abstract
We study the learning dynamics of a multi-pass, mini-batch Stochastic Gradient Descent (SGD) procedure for empirical risk minimization in high-dimensional multi-index models with isotropic random data. In an asymptotic regime where the sample size $n$ and data dimension $d$ increase proportionally, for any sub-linear batch size $\kappa \asymp n^\alpha$ where $\alpha \in [0,1)$, and for a commensurate ``critical''scaling of the learning rate, we provide an asymptotically exact characterization of the coordinate-wise dynamics of SGD. This characterization takes the form of a system of dynamical mean-field equations, driven by a scalar Poisson jump process that represents the asymptotic limit of SGD sampling noise. We develop an analogous characterization of the Stochastic Modified Equation (SME) which provides a Gaussian diffusion approximation to SGD. Our analyses imply that the limiting dynamics for SGD are the same for any batch size scaling $\alpha \in [0,1)$, and that under a commensurate scaling of the learning rate, dynamics of SGD, SME, and gradient flow are mutually distinct, with those of SGD and SME coinciding in the special case of a linear model. We recover a known dynamical mean-field characterization of gradient flow in a limit of small learning rate, and of one-pass/online SGD in a limit of increasing sample size $n/d \to \infty$.