🤖 AI Summary
This work investigates the information-theoretic limits of learning from a hierarchical single-hidden-layer teacher network and transferring knowledge to a smaller student model in high-dimensional, noisy settings. Leveraging tools from high-dimensional statistical physics, leave-one-out decoupling, and fixed-point equation analysis, the study reveals a sequence of sharp phase transitions in feature learning: as the sample size increases, features at different hierarchical levels become learnable successively. The authors introduce the notion of “effective width,” which unifies two previously known scaling laws and yields a closed-form expression for the Bayes-optimal generalization error, scaling as Θ(k_c d/n). Experiments demonstrate that training student models near this effective width enables them to closely approach the theoretical performance limit.
📝 Abstract
We study the information-theoretic limits of learning a one-hidden-layer teacher network with hierarchical features from noisy queries, in the context of knowledge transfer to a smaller student model. We work in the high-dimensional regime where the teacher width $k$ scales linearly with the input dimension $d$ -- a setting that captures large-but-finite-width networks and has only recently become analytically tractable. Using a heuristic leave-one-out decoupling argument, validated numerically throughout, we derive asymptotically sharp characterizations of the Bayes-optimal generalization error and individual feature overlaps via a system of closed fixed-point equations. These equations reveal that feature learnability is governed by a sequence of sharp phase transitions: as data grows, teacher features become recoverable sequentially, each through a discontinuous jump in overlap. This sequential acquisition underlies a precise notion of \textit{effective width} $k_c$ -- the number of learnable features at a given data budget $n$ -- which unifies two distinct scaling regimes: a feature-learning regime in which the Bayes-optimal generalization error $\varepsilon^{\rm BO}$ scales as $ n^{1/(2β)-1}$, and a refinement regime in which it scales as $n^{-1}$, where $β>1/2$ is the exponent of the power-law feature hierarchy. Both laws collapse to the single relation $\varepsilon^{\rm BO}=Θ(k_c d/n)$. We further show empirically that a student trained with \textsc{Adam} near the effective width $k_c$ achieves these optimal scaling laws (up to a small algorithmic gap), and provide an information-theoretic account of the associated scaling in model size.