InfoLaw: Information Scaling Laws for Large Language Models with Quality-Weighted Mixture Data and Repetition

📅 2026-05-04

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

Existing scaling laws struggle to accurately predict the performance of large language models under varying data mixtures and repeated training, hindering efficient data recipe selection. This work proposes InfoLaw, a framework that models pretraining as an information accumulation process and, for the first time, unifies data quality weighting, training repetition, and model scale into a single formulation featuring scale-dependent diminishing returns. Evaluated on large-scale experiments with 7B-parameter models trained on 425B tokens, InfoLaw achieves a mean absolute error of only 0.15% (maximum 0.96%) in predicting loss for unseen data recipes, substantially improving the reliability of performance extrapolation and the efficiency of data selection across diverse recipes and overtraining scenarios.

📝 Abstract

Upweighting high-quality data in LLM pretraining often improves performance, but in datalimited regimes, especially under overtraining, stronger upweighting increases repetition and can degrade performance. However, standard scaling laws do not reliably extrapolate across mixture recipes or under repetitions, making the selection for optimal data recipes at scaling underdetermined. To solve this, we introduce InfoLaw (Information Scaling Laws), a data-aware scaling framework that predicts loss from consumed tokens, model size, data mixture weights, and repetition. The key idea is to model pretraining as information accumulation, where quality controls information density and repetition induces scaledependent diminishing returns. We first collect the model performance after training on datasets that vary in scale, quality distribution, and repetition level. Then we build up the modeling for information so that information accurately predicts those model performance. InfoLaw predicts performance on unseen data recipes and larger scale runs (up to 7B, 425B tokens) with 0.15% mean and 0.96% max absolute error in loss, and it extrapolates reliably across overtraining levels, enabling efficient data-recipe selection under varying compute budgets.

Problem

Research questions and friction points this paper is trying to address.

scaling laws

data mixture

repetition

pretraining

large language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Information Scaling Laws

data mixture

repetition