Sharp feature-learning transitions and Bayes-optimal neural scaling laws in extensive-width networks

📅 2026-05-11

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

This work investigates the information-theoretic limits of learning from a hierarchical single-hidden-layer teacher network and transferring knowledge to a smaller student model in high-dimensional, noisy settings. Leveraging tools from high-dimensional statistical physics, leave-one-out decoupling, and fixed-point equation analysis, the study reveals a sequence of sharp phase transitions in feature learning: as the sample size increases, features at different hierarchical levels become learnable successively. The authors introduce the notion of “effective width,” which unifies two previously known scaling laws and yields a closed-form expression for the Bayes-optimal generalization error, scaling as Θ(k_c d/n). Experiments demonstrate that training student models near this effective width enables them to closely approach the theoretical performance limit.

📝 Abstract

We study the information-theoretic limits of learning a one-hidden-layer teacher network with hierarchical features from noisy queries, in the context of knowledge transfer to a smaller student model. We work in the high-dimensional regime where the teacher width $k$ scales linearly with the input dimension $d$ -- a setting that captures large-but-finite-width networks and has only recently become analytically tractable. Using a heuristic leave-one-out decoupling argument, validated numerically throughout, we derive asymptotically sharp characterizations of the Bayes-optimal generalization error and individual feature overlaps via a system of closed fixed-point equations. These equations reveal that feature learnability is governed by a sequence of sharp phase transitions: as data grows, teacher features become recoverable sequentially, each through a discontinuous jump in overlap. This sequential acquisition underlies a precise notion of \textit{effective width} $k_c$ -- the number of learnable features at a given data budget $n$ -- which unifies two distinct scaling regimes: a feature-learning regime in which the Bayes-optimal generalization error $\varepsilon^{\rm BO}$ scales as $ n^{1/(2β)-1}$, and a refinement regime in which it scales as $n^{-1}$, where $β>1/2$ is the exponent of the power-law feature hierarchy. Both laws collapse to the single relation $\varepsilon^{\rm BO}=Θ(k_c d/n)$. We further show empirically that a student trained with \textsc{Adam} near the effective width $k_c$ achieves these optimal scaling laws (up to a small algorithmic gap), and provide an information-theoretic account of the associated scaling in model size.

Problem

Research questions and friction points this paper is trying to address.

feature learning

Bayes-optimal

scaling laws

phase transitions

knowledge transfer

Innovation

Methods, ideas, or system contributions that make the work stand out.

sharp phase transitions

Bayes-optimal scaling laws

effective width