The global convergence time of stochastic gradient descent in non-convex landscapes: Sharp estimates via large deviations

📅 2025-03-20

📈 Citations: 0

✨ Influential: 0

career value

246K/year

🤖 AI Summary

This work investigates the **first-escape time** of stochastic gradient descent (SGD) from non-convex loss landscapes to the global minimum. We develop a novel analytical framework grounded in stochastic dynamical systems theory and Freidlin–Wentzell large-deviation theory. For the first time, we derive **tight exponential-order upper and lower bounds** on the global convergence time, rigorously establishing that it is governed by the *highest energy barrier* in the loss landscape. Our analysis tightly couples convergence rate to both the noise statistics (e.g., covariance structure) and the geometric depth of obstacle sets. Crucially, we uncover an intrinsic synergy between geometric obstacles and noise-induced escape cost—the “geometry–noise trade-off.” We further extend the theory to shallow local minima. Empirical validation on canonical deep neural network loss surfaces confirms strong agreement between theoretical predictions and observed dynamics, offering new insights into SGD’s implicit bias and generalization behavior.

Technology Category

Application Category

📝 Abstract

In this paper, we examine the time it takes for stochastic gradient descent (SGD) to reach the global minimum of a general, non-convex loss function. We approach this question through the lens of randomly perturbed dynamical systems and large deviations theory, and we provide a tight characterization of the global convergence time of SGD via matching upper and lower bounds. These bounds are dominated by the most"costly"set of obstacles that the algorithm may need to overcome to reach a global minimizer from a given initialization, coupling in this way the global geometry of the underlying loss landscape with the statistics of the noise entering the process. Finally, motivated by applications to the training of deep neural networks, we also provide a series of refinements and extensions of our analysis for loss functions with shallow local minima.

Problem

Research questions and friction points this paper is trying to address.

Estimates global convergence time of SGD in non-convex landscapes.

Uses large deviations theory to analyze SGD dynamics.

Links loss landscape geometry with noise statistics in SGD.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses large deviations theory for SGD analysis

Links loss landscape geometry with noise statistics

Extends analysis to shallow local minima scenarios

🔎 Similar Papers

Convergence of SGD with momentum in the nonconvex case: A novel time window-based analysis