Convergence Rates for Gradient Descent on the Edge of Stability in Overparametrised Least Squares

📅 2025-10-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper investigates the convergence behavior of gradient descent for overparameterized least-squares problems at the “edge of stability”—i.e., under large learning rates where classical small-step analyses break down. The objective is non-monotonic and implicitly favors flat minima, posing challenges for conventional convergence theory. To address this, the authors model the global minimizer set as a Riemannian manifold and perform a dynamical decomposition using bifurcation theory from dynamical systems. They rigorously characterize exact convergence rates across three learning-rate regimes: subcritical (linear convergence to suboptimal flat solutions), critical (power-law convergence to optimal solutions), and supercritical (linear convergence to a period-2 orbit). This is the first work to establish such precise regime-dependent convergence guarantees. The analysis reveals an intrinsic link between transient instability and the selection of optimally flat minima, thereby providing a theoretical foundation for understanding the implicit regularization induced by large-step optimization.

Technology Category

Application Category

📝 Abstract
Classical optimisation theory guarantees monotonic objective decrease for gradient descent (GD) when employed in a small step size, or ``stable", regime. In contrast, gradient descent on neural networks is frequently performed in a large step size regime called the ``edge of stability", in which the objective decreases non-monotonically with an observed implicit bias towards flat minima. In this paper, we take a step toward quantifying this phenomenon by providing convergence rates for gradient descent with large learning rates in an overparametrised least squares setting. The key insight behind our analysis is that, as a consequence of overparametrisation, the set of global minimisers forms a Riemannian manifold $M$, which enables the decomposition of the GD dynamics into components parallel and orthogonal to $M$. The parallel component corresponds to Riemannian gradient descent on the objective sharpness, while the orthogonal component is a bifurcating dynamical system. This insight allows us to derive convergence rates in three regimes characterised by the learning rate size: (a) the subcritical regime, in which transient instability is overcome in finite time before linear convergence to a suboptimally flat global minimum; (b) the critical regime, in which instability persists for all time with a power-law convergence toward the optimally flat global minimum; and (c) the supercritical regime, in which instability persists for all time with linear convergence to an orbit of period two centred on the optimally flat global minimum.
Problem

Research questions and friction points this paper is trying to address.

Analyzing gradient descent convergence with large learning rates
Quantifying non-monotonic optimization in overparametrized least squares
Characterizing three convergence regimes at edge of stability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyzes gradient descent with large learning rates
Decomposes dynamics using Riemannian manifold geometry
Derives convergence rates for three learning regimes
🔎 Similar Papers
No similar papers found.
L
Lachlan Ewen MacDonald
Innovation in Data Engineering and Science (IDEAS), University of Pennsylvania, Pennsylvania, PA 19104
Hancheng Min
Hancheng Min
Shanghai Jiao Tong University
Deep Learning TheoryDynamical Systems and ControlNetworked Systems
L
Leandro Palma
Innovation in Data Engineering and Science (IDEAS), University of Pennsylvania, Pennsylvania, PA 19104
Salma Tarmoun
Salma Tarmoun
PhD student, University of Pennsylvania
Machine LearningOptimizationControl TheoryArtificial Intelligence
Ziqing Xu
Ziqing Xu
University of Pennsylvania
Deep learning theoryOptimizationTheory of machine learning
R
René Vidal
Innovation in Data Engineering and Science (IDEAS), University of Pennsylvania, Pennsylvania, PA 19104