๐ค AI Summary
Existing theoretical frameworks struggle to quantitatively predict neural scaling law exponents from the statistical properties of natural language, particularly in data-constrained regimes. This work proposes the first ab initio theory that requires no free parameters, grounded in information theory and statistical language modeling. By analyzing the decay of token-pair correlations with temporal separation and the decay of next-token conditional entropy with context length, the theory derives a closed-form expression for scaling laws. Relying solely on intrinsic statistical characteristics of natural language, it accurately predicts scaling exponents under limited data conditions. Experimental validation demonstrates close agreement between theoretical predictions and empirical measurements from scratch-trained GPT-2 and LLaMA models on the TinyStories and WikiText datasets.
๐ Abstract
Despite the fact that experimental neural scaling laws have substantially guided empirical progress in large-scale machine learning, no existing theory can quantitatively predict the exponents of these important laws for any modern LLM trained on any natural language dataset. We provide the first such theory in the case of data-limited scaling laws. We isolate two key statistical properties of language that alone can predict neural scaling exponents: (i) the decay of pairwise token correlations with time separation between token pairs, and (ii) the decay of the next-token conditional entropy with the length of the conditioning context. We further derive a simple formula in terms of these statistics that predicts data-limited neural scaling exponents from first principles without any free parameters or synthetic data models. Our theory exhibits a remarkable match with experimentally measured neural scaling laws obtained from training GPT-2 and LLaMA style models from scratch on two qualitatively different benchmarks, TinyStories and WikiText.