Spectral Dynamics in Deep Networks: Feature Learning, Outlier Escape, and Learning Rate Transfer

📅 2026-05-08

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

This work investigates the spectral dynamics of hidden weights in wide neural networks trained via (stochastic) gradient descent, with a focus on the interplay between outlier eigenvalues and the bulk spectrum and their implications for learning rate transferability and feature learning. By developing a two-layer dynamical mean-field theory (DMFT), the study provides the first unified characterization of spectral evolution in infinitely wide nonlinear networks and deep linear networks in the high-dimensional limit, accounting for statistical dependencies between spike directions and the bulk. The analysis reveals that μP parametrization enables width-independent outlier dynamics and hyperparameter transfer, whereas NTK parametrization exhibits strong width dependence. Experiments confirm theoretical predictions regarding the evolution of outliers with respect to training time, network width, output scale, and initialization variance, and reveal bulk spectral restructuring in multi-output tasks such as ImageNet and GPT, while confirming convergence of the spectral edge in sufficiently wide networks.

📝 Abstract

We study the evolution of hidden-weight spectra in wide neural networks trained by (stochastic) gradient descent. We develop a two-level dynamical mean-field theory (DMFT) that jointly tracks bulk and outlier spectral dynamics for spiked ensembles whose spike directions remain statistically dependent on the random bulk. We apply this framework to two settings: (1) infinite-width nonlinear networks in mean-field/$μ$P scaling and (2) deep linear networks in the proportional high-dimensional limit, where width, input dimension, and sample size diverge with fixed ratios. Our theory predicts how outliers evolve with training time, width, output scale, and initialization variance. In deep linear networks, $μ$P yields width-consistent outlier dynamics and hyperparameter transfer, including width-stable growth of the leading NTK mode toward the edge of stability (EoS). In contrast, NTK parameterization exhibits strongly width-dependent outlier dynamics, despite converging to a stable large-width limit. We show that this bulk+outlier picture is descriptive of simple tasks with small output channels, but that tasks involving large numbers of outputs (ImageNet classification or GPT language modeling) are better described by a restructuring of the spectral bulk. We develop a toy model with extensive output channels that recapitulates this phenomenon and show that edge of the spectrum still converges for sufficiently wide networks.

Problem

Research questions and friction points this paper is trying to address.

spectral dynamics

outlier evolution

neural networks

gradient descent

high-dimensional limit

Innovation

Methods, ideas, or system contributions that make the work stand out.

dynamical mean-field theory

spectral outliers

μP parameterization