On the origin of neural scaling laws: from random graphs to natural language

📅 2026-01-15

📈 Citations: 3

✨ Influential: 0

career value

199K/year

🤖 AI Summary

This study investigates whether neural scaling laws depend on power-law structures in data. By training a minimal Transformer—comprising only two layers and a context length of 50—to predict random walk sequences on synthetic graphs of tunable complexity (including Erdős–Rényi and Barabási–Albert models), the authors systematically simplify the language modeling task. They demonstrate, for the first time, that neural scaling laws emerge even in data devoid of power-law correlations, revealing a monotonic relationship between linguistic complexity and the scaling exponent. This finding establishes that power-law structures in data are not necessary for the emergence of scaling laws. Furthermore, the work introduces an alternative method for constructing compute-optimal curves and shows that the Maximal Update Parametrization (μP) achieves superior parameter efficiency.

Technology Category

Application Category

📝 Abstract

Scaling laws have played a major role in the modern AI revolution, providing practitioners predictive power over how the model performance will improve with increasing data, compute, and number of model parameters. This has spurred an intense interest in the origin of neural scaling laws, with a common suggestion being that they arise from power law structure already present in the data. In this paper we study scaling laws for transformers trained to predict random walks (bigrams) on graphs with tunable complexity. We demonstrate that this simplified setting already gives rise to neural scaling laws even in the absence of power law structure in the data correlations. We further consider dialing down the complexity of natural language systematically, by training on sequences sampled from increasingly simplified generative language models, from 4,2,1-layer transformer language models down to language bigrams, revealing a monotonic evolution of the scaling exponents. Our results also include scaling laws obtained from training on random walks on random graphs drawn from Erd\"os-Renyi and scale-free Barab\'asi-Albert ensembles. Finally, we revisit conventional scaling laws for language modeling, demonstrating that several essential results can be reproduced using 2 layer transformers with context length of 50, provide a critical analysis of various fits used in prior literature, demonstrate an alternative method for obtaining compute optimal curves as compared with current practice in published literature, and provide preliminary evidence that maximal update parameterization may be more parameter efficient than standard parameterization.

Problem

Research questions and friction points this paper is trying to address.

neural scaling laws

origin

power law structure

transformers

random graphs

Innovation

Methods, ideas, or system contributions that make the work stand out.

neural scaling laws

random graphs

simplified language models