Gradient Descent Converges Linearly to Flatter Minima than Gradient Flow in Shallow Linear Networks

📅 2025-01-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates the optimization dynamics of gradient descent (GD) versus gradient flow (GF) in shallow linear neural networks. Methodologically, it employs continuous-time approximation, sharpness analysis, and matrix differential equation modeling. The paper provides the first rigorous proof that GD converges to the global minimum at an explicit linear rate even when the step size approaches $2/ ext{sharpness}$. Theoretically and empirically, GD solutions exhibit smaller parameter norms and lower sharpness than GF solutions—indicating stronger implicit regularization. Crucially, a fundamental trade-off emerges: increasing the step size accelerates convergence but weakens implicit regularization. This result offers the first formal theoretical explanation for the “edge of stability” training phenomenon and reveals the pivotal role of discretization step size in shaping implicit bias.

Technology Category

Application Category

📝 Abstract
We study the gradient descent (GD) dynamics of a depth-2 linear neural network with a single input and output. We show that GD converges at an explicit linear rate to a global minimum of the training loss, even with a large stepsize -- about $2/ extrm{sharpness}$. It still converges for even larger stepsizes, but may do so very slowly. We also characterize the solution to which GD converges, which has lower norm and sharpness than the gradient flow solution. Our analysis reveals a trade off between the speed of convergence and the magnitude of implicit regularization. This sheds light on the benefits of training at the ``Edge of Stability'', which induces additional regularization by delaying convergence and may have implications for training more complex models.
Problem

Research questions and friction points this paper is trying to address.

Gradient Descent
Neural Networks
Optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Gradient Descent
Neural Networks
Slow Training
🔎 Similar Papers
No similar papers found.
P
Pierfrancesco Beneventano
Princeton University, Princeton, NJ, USA
Blake Woodworth
Blake Woodworth
Google
OptimizationMachine Learning