Relative Scaling Laws for LLMs

📅 2025-10-28

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

Conventional scaling laws focus solely on aggregate performance gains, obscuring heterogeneous performance evolution across subpopulations. Method: We propose “relative scaling laws” to systematically track how large language models’ performance gaps evolve across diverse test distributions—including academic domains, linguistic variants, and AI risk categories—during scale-up. Using 255 decoder-only Transformer models trained under IsoFLOP constraints and evaluated on a multidimensional benchmark suite, we analyze disparities across dimensions. Contribution/Results: We find (i) MMLU subject-wise performance converges toward equilibrium; (ii) regional English proficiency shifts with population-scale bias; (iii) capability- and influence-related AI risks exhibit widening disparities, whereas adversarial risks show no significant increase. Critically, we demonstrate for the first time that scaling does not universally narrow performance gaps—some inequities intensify with model size. All model checkpoints are publicly released.

Technology Category

Application Category

📝 Abstract

Scaling laws describe how language models improve with additional data, parameters, and compute. While widely used, they are typically measured on aggregate test sets. Aggregate evaluations yield clean trends but average over heterogeneous subpopulations, obscuring performance disparities. We introduce relative scaling laws, which track how performance gaps between test distributions evolve with scale rather than focusing solely on absolute error. Using 255 decoder-only Transformers trained under matched-compute (IsoFLOP) budgets from $10^{18}$--$10^{20}$ FLOPs on standard pretraining datasets, we find diverse trajectories: academic domains on MMLU converge toward parity; regional English dialects shift depending on population size; and clusters of AI risk behaviours split, with capability- and influence-related risks increasing during pretraining while adversarial risks do not. These results show that although scaling improves overall performance, it is not a universal equalizer. To support further study, we release all model checkpoints from this work to enable practitioners to measure relative alongside traditional scaling laws, in order to better prioritize robustness challenges in light of the bitter lesson.

Problem

Research questions and friction points this paper is trying to address.

Analyzing performance disparities across test distributions with scaling

Tracking evolving capability gaps between domains during model growth

Investigating how scaling affects risks and specialization patterns differently

Innovation

Methods, ideas, or system contributions that make the work stand out.

Relative scaling laws track performance gap evolution

IsoFLOP-matched Transformers trained across FLOPs range

Released checkpoints enable measuring relative scaling laws

🔎 Similar Papers

Rethinking Conventional Wisdom in Machine Learning: From Generalization to Scaling