🤖 AI Summary
This work addresses the inability of existing scaling laws for large language models to account for non-monotonic performance degradation caused by overtraining or quantization. By modeling the training process as information transmission over a noisy channel, the authors propose “Shannon scaling laws” grounded in the Shannon–Hartley theorem, offering the first information-theoretic characterization of the fundamental relationship between model capacity and signal-to-noise ratio. This framework unifies the explanation of U-shaped loss curves observed in practice. Integrating channel capacity theory, Gaussian noise modeling, and quantization perturbation analysis, the approach is validated on Pythia and OLMo2 benchmarks. Experiments demonstrate its superiority over classical scaling laws across multiple tasks, achieving an R² of 0.847 in predicting the loss of an unseen 12B-parameter model trained on 307B tokens and accurately capturing loss basins missed by conventional methods.
📝 Abstract
Existing scaling laws for Large Language Models (LLMs), predominantly monotonic power laws, fail to explain emerging non-monotonic phenomena such as catastrophic overtraining and quantization-induced degradation, where performance deteriorates despite increased compute.
We propose the Shannon Scaling Law, a unified theoretical framework that models LLM training as information transmission over a noisy channel, grounded in the Shannon-Hartley theorem. By mapping model parameters to channel bandwidth and training tokens to signal power, our formulation explicitly captures the interaction between learning signal and intrinsic noise. This perspective reveals a fundamental Shannon capacity for LLMs: scaling model size or data without preserving a sufficient signal-to-noise ratio (SNR) inevitably amplifies noise, inducing a transition from monotonic improvement to U-shaped performance degradation.
We validate our theory through experiments on Pythia and OLMo2 under perturbations, including Gaussian noise, quantization and supervised fine-tuning on math, QA and code tasks. The Shannon Scaling Law consistently outperforms classical scaling laws and recent perturbation-aware laws, achieving strong $R^2$ scores and accurately capturing loss basins missed by prior approaches. It also extrapolates: fitted on $\leq$6.9B Pythia models with $\leq$180B tokens, it predicts the unseen 12B model up to 307B tokens at pooled $R^2{=}0.847$, while monotonic baselines collapse.