🤖 AI Summary
This work uncovers the mechanism underlying the power-law decay of prediction error with increasing data volume in deep networks, attributing it to the layerwise recovery of latent compositional features through hierarchical representation learning. Focusing on a class of high-dimensional hierarchical target functions with power-law decaying weights, the authors propose a layerwise spectral algorithm grounded in random matrix theory and resolvent perturbation analysis, establishing—for the first time—a theoretical link between the sequential recovery of features and the global scaling law. By deriving sharp thresholds for individual feature recovery and an explicit power-law form for prediction error, the analysis surpasses conventional gap-dependent perturbation bounds, yielding tight recovery guarantees. Numerical experiments confirm the sequential recovery of features according to their strength, the smoothing of thresholds at finite sample sizes, and superior performance over non-hierarchical kernel methods.
📝 Abstract
We propose a simple mechanism by which scaling laws emerge from feature learning in multi-layer networks. We study a high-dimensional hierarchical target that is a globally high-degree function, but that can be represented by a combination of latent compositional features whose weights decrease as a power law. We show that a layer-wise spectral algorithm adapted to this compositional structure achieves improved scaling relative to shallow, non-adaptive methods, and recovers the latent directions sequentially: strong features become detectable at small sample sizes, while weaker features require more data. We prove sharp feature-wise recovery thresholds and show that aggregating these transitions yields an explicit power-law decay of the prediction error. Technically, the analysis relies on random matrix methods and a resolvent-based perturbation argument, which gives matching upper and lower bounds for individual eigenvector recovery beyond what standard gap-based perturbation bounds provide. Numerical experiments confirm the predicted sequential recovery, finite-size smoothing of the thresholds, and separation from non-hierarchical kernel baselines. Together, these results show how smooth scaling laws can emerge from a cascade of sharp feature-learning transitions.