Compute-Optimal LLMs Provably Generalize Better With Scale

๐Ÿ“… 2025-04-21
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Why do larger language models generalize better? This paper investigates the fundamental mechanisms underlying improved generalization with scale, grounded in the Chinchilla-compute-optimal scaling framework. We propose the first fully empirical Freedman-type martingale concentration inequality, decomposing generalization error into three interpretable components: parameter-to-token ratio, loss variance, and quantization errorโ€”thereby establishing the first predictive scaling law for the generalization gap. Theoretically, we prove that along the compute-optimal scaling path, both loss variance and quantization error decrease monotonically with model size, leading to a systematic reduction in the generalization gap. Our theoretical predictions align closely with large-scale empirical evaluations across diverse model families and training regimes. This work provides the first causally interpretable, quantitative analytical framework for understanding generalization in large language models.

Technology Category

Application Category

๐Ÿ“ Abstract
Why do larger language models generalize better? To investigate this question, we develop generalization bounds on the pretraining objective of large language models (LLMs) in the compute-optimal regime, as described by the Chinchilla scaling laws. We introduce a novel, fully empirical Freedman-type martingale concentration inequality that tightens existing bounds by accounting for the variance of the loss function. This generalization bound can be decomposed into three interpretable components: the number of parameters per token, the loss variance, and the quantization error at a fixed bitrate. As compute-optimal language models are scaled up, the number of parameters per data point remains constant; however, both the loss variance and the quantization error decrease, implying that larger models should have smaller generalization gaps. We examine why larger models tend to be more quantizable from an information theoretic perspective, showing that the rate at which they can integrate new information grows more slowly than their capacity on the compute-optimal frontier. From these findings we produce a scaling law for the generalization gap, with bounds that become predictably stronger with scale.
Problem

Research questions and friction points this paper is trying to address.

Investigates why larger LLMs generalize better
Develops compute-optimal generalization bounds for LLMs
Analyzes scaling laws for generalization gap reduction
Innovation

Methods, ideas, or system contributions that make the work stand out.

Empirical Freedman-type martingale inequality tightens bounds
Generalization bound decomposes into three interpretable components
Scaling law predicts stronger generalization with scale
๐Ÿ”Ž Similar Papers
No similar papers found.