On the origin of neural scaling laws: from random graphs to natural language

📅 2026-01-15
📈 Citations: 3
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates whether neural scaling laws depend on power-law structures in data. By training a minimal Transformer—comprising only two layers and a context length of 50—to predict random walk sequences on synthetic graphs of tunable complexity (including Erdős–Rényi and Barabási–Albert models), the authors systematically simplify the language modeling task. They demonstrate, for the first time, that neural scaling laws emerge even in data devoid of power-law correlations, revealing a monotonic relationship between linguistic complexity and the scaling exponent. This finding establishes that power-law structures in data are not necessary for the emergence of scaling laws. Furthermore, the work introduces an alternative method for constructing compute-optimal curves and shows that the Maximal Update Parametrization (μP) achieves superior parameter efficiency.

Technology Category

Application Category

📝 Abstract
Scaling laws have played a major role in the modern AI revolution, providing practitioners predictive power over how the model performance will improve with increasing data, compute, and number of model parameters. This has spurred an intense interest in the origin of neural scaling laws, with a common suggestion being that they arise from power law structure already present in the data. In this paper we study scaling laws for transformers trained to predict random walks (bigrams) on graphs with tunable complexity. We demonstrate that this simplified setting already gives rise to neural scaling laws even in the absence of power law structure in the data correlations. We further consider dialing down the complexity of natural language systematically, by training on sequences sampled from increasingly simplified generative language models, from 4,2,1-layer transformer language models down to language bigrams, revealing a monotonic evolution of the scaling exponents. Our results also include scaling laws obtained from training on random walks on random graphs drawn from Erd\"os-Renyi and scale-free Barab\'asi-Albert ensembles. Finally, we revisit conventional scaling laws for language modeling, demonstrating that several essential results can be reproduced using 2 layer transformers with context length of 50, provide a critical analysis of various fits used in prior literature, demonstrate an alternative method for obtaining compute optimal curves as compared with current practice in published literature, and provide preliminary evidence that maximal update parameterization may be more parameter efficient than standard parameterization.
Problem

Research questions and friction points this paper is trying to address.

neural scaling laws
origin
power law structure
transformers
random graphs
Innovation

Methods, ideas, or system contributions that make the work stand out.

neural scaling laws
random graphs
simplified language models
maximal update parameterization
compute-optimal scaling
🔎 Similar Papers
No similar papers found.