🤖 AI Summary
This work investigates the optimization dynamics of gradient descent (GD) versus gradient flow (GF) in shallow linear neural networks. Methodologically, it employs continuous-time approximation, sharpness analysis, and matrix differential equation modeling. The paper provides the first rigorous proof that GD converges to the global minimum at an explicit linear rate even when the step size approaches $2/ ext{sharpness}$. Theoretically and empirically, GD solutions exhibit smaller parameter norms and lower sharpness than GF solutions—indicating stronger implicit regularization. Crucially, a fundamental trade-off emerges: increasing the step size accelerates convergence but weakens implicit regularization. This result offers the first formal theoretical explanation for the “edge of stability” training phenomenon and reveals the pivotal role of discretization step size in shaping implicit bias.
📝 Abstract
We study the gradient descent (GD) dynamics of a depth-2 linear neural network with a single input and output. We show that GD converges at an explicit linear rate to a global minimum of the training loss, even with a large stepsize -- about $2/ extrm{sharpness}$. It still converges for even larger stepsizes, but may do so very slowly. We also characterize the solution to which GD converges, which has lower norm and sharpness than the gradient flow solution. Our analysis reveals a trade off between the speed of convergence and the magnitude of implicit regularization. This sheds light on the benefits of training at the ``Edge of Stability'', which induces additional regularization by delaying convergence and may have implications for training more complex models.