🤖 AI Summary
This work investigates the optimization and generalization dynamics of stochastic gradient descent (SGD) in high-dimensional diagonal linear networks. By constructing a stochastic differential equation (SDE) to approximate SGD trajectories and deriving deterministic partial differential equations that govern the evolution of key statistical quantities such as risk and curvature, the study explicitly decouples the drift and gradient noise components of SGD for the first time in a high-dimensional setting. Building on this decomposition, the authors establish a globally well-posed non-asymptotic theoretical framework that guarantees exponential convergence to zero risk with high probability under appropriate parametrization. The theoretical predictions are corroborated by numerical experiments, demonstrating excellent agreement between analysis and empirical observation.
📝 Abstract
Understanding the behavior of stochastic gradient methods is a central problem in modern machine learning. Recent work has highlighted diagonal linear networks as a simplified yet expressive setting for analyzing the optimization and generalization properties of neural models. In this work, we show that in the high-dimensional regime, stochastic gradient descent on diagonal linear networks is well-approximated by continuous dynamics governed by a stochastic differential equation (SDE), which explicitly decouples the drift from the gradient noise. We further derive a deterministic partial differential equation whose solution propagates the relevant state of the iterates and characterizes the time evolution of a broad class of observable statistics, including the risk, curvature, and other metrics for optimality. Finally, we show that, under a suitable parametrization, the stochastic dynamics are globally well posed and converge exponentially fast to zero risk with high probability, yielding a fully explicit non-asymptotic description of their long-time behavior. Numerical simulations corroborate our theoretical findings.