Feature Learning in Linear-Width Two-Layer Networks: Two vs. One Step of Gradient Descent

📅 2026-05-17

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

This work investigates the feature learning dynamics of the first two steps of gradient descent in linear-width two-layer neural networks. While single-step gradient descent is limited to capturing only one directional feature and requires the target function’s information exponent to be one, this study—leveraging random matrix theory and high-dimensional asymptotic analysis—provides the first complete characterization of the spectral structure of the weight matrix after the second update. The results demonstrate that appropriately scaled learning rates can induce multiple outlier eigen-directions, with their number governed by the scaling parameter. Moreover, reusing training batches enables the network to surpass the information exponent constraint, thereby facilitating richer, multi-directional feature learning in the high-dimensional limit.

📝 Abstract

We study feature learning in two-layer neural networks within the linear-width regime, where the number of hidden neurons, sample size, and input dimension scale proportionally. While recent work has analyzed feature learning via a single step of gradient descent, such updates are fundamentally limited: they are approximately rank-one, capturing only a single direction, and require the target function to have an information exponent of one. In this paper, we go beyond one-step updates to provide a full characterization of the features learned during the second step of gradient descent with step-sizes $η_1 \asymp N^{α_1}$ and $η_2 \asymp N^{α_2}$ for $α_1, α_2 \in [0,0.5)$. We derive a sharp spectral characterization of the updated weights, demonstrating they behave as a spiked random matrix with multiple outliers, each corresponding to a learned direction. We show that the number of these outliers is determined by the scaling parameters $α_1$ and $α_2$ through $\lfloor \frac{α_2}{1/2 - α_1} \rfloor$. Furthermore, by analyzing the alignment between these learned directions and the target function, we identify a qualitative gap between training with independent versus reused batches. While independent batches restrict learning to directions with an information exponent of one, batch reuse enables the second update to capture directions even when the information exponent exceeds one, under the condition that $α_1, α_2$ are chosen properly. This confirms that the benefits of batch reuse, previously observed in finite-width regimes, persist in the high-dimensional linear-width limit. By characterizing these early-phase spectral transitions, our work establishes a tractable mathematical framework for studying optimization and feature learning phenomenology in modern overparameterized networks.

Problem

Research questions and friction points this paper is trying to address.

feature learning

two-layer neural networks

gradient descent

linear-width regime

information exponent

Innovation

Methods, ideas, or system contributions that make the work stand out.

feature learning

gradient descent

spiked random matrix