InfoLaw: Information Scaling Laws for Large Language Models with Quality-Weighted Mixture Data and Repetition

📅 2026-05-04
📈 Citations: 0
Influential: 0
📄 PDF

career value

212K/year
🤖 AI Summary
Existing scaling laws struggle to accurately predict the performance of large language models under varying data mixtures and repeated training, hindering efficient data recipe selection. This work proposes InfoLaw, a framework that models pretraining as an information accumulation process and, for the first time, unifies data quality weighting, training repetition, and model scale into a single formulation featuring scale-dependent diminishing returns. Evaluated on large-scale experiments with 7B-parameter models trained on 425B tokens, InfoLaw achieves a mean absolute error of only 0.15% (maximum 0.96%) in predicting loss for unseen data recipes, substantially improving the reliability of performance extrapolation and the efficiency of data selection across diverse recipes and overtraining scenarios.
📝 Abstract
Upweighting high-quality data in LLM pretraining often improves performance, but in datalimited regimes, especially under overtraining, stronger upweighting increases repetition and can degrade performance. However, standard scaling laws do not reliably extrapolate across mixture recipes or under repetitions, making the selection for optimal data recipes at scaling underdetermined. To solve this, we introduce InfoLaw (Information Scaling Laws), a data-aware scaling framework that predicts loss from consumed tokens, model size, data mixture weights, and repetition. The key idea is to model pretraining as information accumulation, where quality controls information density and repetition induces scaledependent diminishing returns. We first collect the model performance after training on datasets that vary in scale, quality distribution, and repetition level. Then we build up the modeling for information so that information accurately predicts those model performance. InfoLaw predicts performance on unseen data recipes and larger scale runs (up to 7B, 425B tokens) with 0.15% mean and 0.96% max absolute error in loss, and it extrapolates reliably across overtraining levels, enabling efficient data-recipe selection under varying compute budgets.
Problem

Research questions and friction points this paper is trying to address.

scaling laws
data mixture
repetition
pretraining
large language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Information Scaling Laws
data mixture
repetition
quality-weighted data
loss prediction