Uniform-in-Time Weak Propagation-of-Chaos in Shallow Neural Networks

📅 2026-05-21

📈 Citations: 0

✨ Influential: 0

career value

251K/year

🤖 AI Summary

This work investigates the control of output discrepancy between finite-width shallow neural networks under long-time training and their infinite-width mean-field limit. By analyzing the convergence rate of the mean-field Wasserstein gradient flow, the authors establish, for the first time, a time-uniform non-asymptotic weak propagation-of-chaos bound without requiring assumptions such as noise injection, local strong convexity of the loss, or logarithmic Sobolev inequalities. The result accommodates both sample and time discretization. When the mean-field loss converges faster than $t^{-2}$, achieving $\varepsilon$-accuracy requires only $\mathrm{poly}(d/\varepsilon)$ neurons, samples, and gradient steps, with an error bound scaling as $\mathrm{poly}(d)\cdot m^{-\min(1,c/6)}$.

📝 Abstract

We consider one-hidden layer neural networks trained in the feature-learning regime using gradient descent, and relate the output of the finite-width network $f_{\hatρ_t^m}$ to its infinite-width counterpart $f_{ρ_t^{MF}}$, which evolves in the mean-field dynamics. While constant-time horizon bounds for $\|f_{ρ_t^{MF}} - f_{\hatρ_t^m}\|$ may be obtained via standard Grönwall estimates, the long-time behavior of the fluctuation is a more delicate matter. Uniform-in-time bounds often rely on (local) strong convexity in the landscape or Logarithmic Sobolev inequalities present in noisy gradient dynamics. In this work, we establish non-asymptotic weak propagation-of-chaos that holds uniformly in time, obtained by exploiting instead the convergence rate of the mean-field deterministic Wasserstein-gradient-flow dynamics. Specifically, denoting by $L_t$ the mean-field excess MSE loss at time $t$ and $m$ the number of neurons, under standard regularity assumptions and the condition $\int_0^\infty L_t^{1/2} dt =O(\log d)$, we obtain the uniform in time bound $\|f_{ρ_t^{MF}}- f_{\hatρ_t^m}\|^2 \lesssim \text{poly}(d) m^{-\min(1,c/6)}$ whenever $L_t \lesssim t^{-c}$. Our result holds in a noiseless setting and does not make any assumptions on the geometry of the landscape near the optimum, and extends seamlessly to other forms of discretization, including finite number of samples and time discretization. A key takeaway of our result is that whenever the convergence rate of the mean-field, population-loss dynamics is faster than $t^{-2}$, we can attain a loss of $ε$ with only $\text{poly}(d/ε)$ neurons, training samples, and GD steps.

Problem

Research questions and friction points this paper is trying to address.

propagation-of-chaos

mean-field limit

shallow neural networks

uniform-in-time bound

gradient descent

Innovation

Methods, ideas, or system contributions that make the work stand out.

propagation-of-chaos

mean-field limit

uniform-in-time convergence