Dynamical Decoupling of Generalization and Overfitting in Large Two-Layer Networks

📅 2025-02-28

📈 Citations: 0

✨ Influential: 0

career value

232K/year

🤖 AI Summary

This work investigates the dynamical separation between generalization and overfitting during training of large-scale two-layer neural networks. Methodologically, it establishes a high-dimensional asymptotic framework grounded in dynamical mean-field theory from nonequilibrium statistical physics. The analysis reveals, for the first time, temporal decoupling between feature learning and overfitting: the former dominates the early fast dynamics, while the latter emerges gradually on a slower timescale. Key theoretical results include a non-monotonic evolution of test error and the identification of “feature forgetting.” The work quantitatively characterizes conditions under which inductive bias emerges from low-complexity initializations and elucidates the dynamical origin of early stopping’s efficacy. By combining Gaussian nonlinear approximations with Rademacher complexity modeling, the theory achieves precise predictions of slow-timescale dynamics. Extensive experiments confirm its strong explanatory power and predictive accuracy for real neural network behavior.

Technology Category

Application Category

📝 Abstract

The inductive bias and generalization properties of large machine learning models are -- to a substantial extent -- a byproduct of the optimization algorithm used for training. Among others, the scale of the random initialization, the learning rate, and early stopping all have crucial impact on the quality of the model learnt by stochastic gradient descent or related algorithms. In order to understand these phenomena, we study the training dynamics of large two-layer neural networks. We use a well-established technique from non-equilibrium statistical physics (dynamical mean field theory) to obtain an asymptotic high-dimensional characterization of this dynamics. This characterization applies to a Gaussian approximation of the hidden neurons non-linearity, and empirically captures well the behavior of actual neural network models. Our analysis uncovers several interesting new phenomena in the training dynamics: $(i)$ The emergence of a slow time scale associated with the growth in Gaussian/Rademacher complexity; $(ii)$ As a consequence, algorithmic inductive bias towards small complexity, but only if the initialization has small enough complexity; $(iii)$ A separation of time scales between feature learning and overfitting; $(iv)$ A non-monotone behavior of the test error and, correspondingly, a `feature unlearning' phase at large times.

Problem

Research questions and friction points this paper is trying to address.

Understanding training dynamics in large two-layer neural networks.

Analyzing impact of initialization and learning rate on model quality.

Exploring separation of feature learning and overfitting time scales.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamical mean field theory for neural networks

Gaussian approximation of hidden neurons

Separation of feature learning and overfitting

🔎 Similar Papers

Geometry and Local Recovery of Global Minima of Two-layer Neural Networks at Overparameterization