🤖 AI Summary
This work addresses the limited universality of existing neural scaling laws, which struggle to coherently characterize the joint effects of model capacity, dataset size, and compute on performance. The authors abstract learning tasks as the progressive coverage of patterns drawn from a long-tailed (Zipfian) distribution and introduce the notion of an “effective frontier” to delineate the boundary between acquired and unacquired knowledge. Building upon resource constraints, they formulate a unified scaling law framework. By incorporating the Max-Bottleneck principle, they demonstrate for the first time that the Kaplan and Chinchilla scaling laws emerge as equilibrium solutions of the same optimization problem under different bottlenecks—namely, capacity, coverage, and optimization. Theoretical predictions align closely with empirical observations, offering broadly applicable guidance for large model training.
📝 Abstract
Neural scaling laws govern the prediction power-law improvement of test loss with respect to model capacity ($N$), datasize ($D$), and compute ($C$). However, existing theoretical explanations often rely on specific architectures or complex kernel methods, lacking intuitive universality. In this paper, we propose a unified framework that abstracts general learning tasks as the progressive coverage of patterns from a long-tail (Zipfian) distribution. We introduce the Effective Frontier ($k_\star$), a threshold in the pattern rank space that separates learned knowledge from the unlearned tail. We prove that reducible loss is asymptotically determined by the probability mass of the tail a resource-dependent frontier truncation. Based on our framework, we derive the precise scaling laws for $N$, $D$, and $C$, attributing them to capacity, coverage, and optimization bottlenecks, respectively. Furthermore, we unify these mechanisms via a Max-Bottleneck principle, demonstrating that the Kaplan and Chinchilla scaling laws are not contradictory, but equilibrium solutions to the same constrained optimization problem under different active bottlenecks.