On the global convergence of gradient descent for wide shallow models with bounded nonlinearities

📅 2026-05-11

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

This work investigates the mechanism by which gradient descent in wide and shallow neural networks converges to global minima despite non-convex loss landscapes. Focusing on vector-output models equipped with bounded nonlinear activations—such as sigmoid or multi-head attention—the authors analyze the continuous-time gradient flow dynamics through the lens of mean-field theory. They introduce, for the first time, an “escaping active set” construction to rigorously demonstrate that all non-global minima are unstable. Consequently, trajectories initialized from fully supported initial conditions converge almost surely to a global minimum. This study extends the theoretical framework of Chizat and Bach (2018) and establishes the well-posedness and numerical robustness of the mean-field limit via stability analysis and discretization arguments.

📝 Abstract

A surprising phenomenon in the training of neural networks is the ability of gradient descent to find global minimizers of the training loss despite its non-convexity. Following earlier works, we investigate this behavior for wide shallow networks. Existing results essentially cover the case of ReLU activations and the case of sigmoid activations with scalar output weights. We study a large class of models that includes multi-head attention layers and two-layer sigmoid networks with vector output weights. Building upon [Chizat and Bach, 2018], we prove that all non-global minimizers of the training loss are unstable under gradient descent dynamics. Thus, when the initial distribution of the parameters has full support (which includes the popular Gaussian case), and in the many hidden neurons or attention heads limit, continuous-time gradient descent can only converge to global minimizers. Establishing the instability of non-global minimizers corresponds to the construction of an ``escaping active set'' -- we complete the proof of [Chizat and Bach, 2018] to construct this set for models with bounded nonlinearities and scalar output weights. We also extend this construction to new cases for models with vector output weights. Finally, we show the well-posedness and the stability with respect to discretization of the mean field training dynamic for sub-Gaussian initializations.

Problem

Research questions and friction points this paper is trying to address.

global convergence

gradient descent

wide shallow networks

bounded nonlinearities

non-convex optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

global convergence

gradient descent

wide shallow networks