Scaling Laws for Mixture Pretraining Under Data Constraints

📅 2026-05-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of effectively pretraining language models when target-domain data are scarce by strategically mixing limited target data with abundant generic data. Through over 2,000 language model experiments, the study systematically investigates how the repetition count and mixing ratio of target data influence downstream performance. It quantifies, for the first time, the tolerance to target data repetition and demonstrates that moderately high repetition (15–20 epochs) substantially enhances target-domain effectiveness. Building on these findings, the authors propose a hybrid scaling law model that integrates the diminishing returns of repeated target data with the regularizing effect of generic data. The model is validated across diverse settings—including multilingual, domain-specific, and quality-filtered corpora—providing both theoretical grounding and practical guidelines for data-constrained pretraining.
📝 Abstract
As language models scale, the amount of data they require grows -- yet many target data sources, such as low-resource languages or specialized domains, are inherently limited in size. A common strategy is to mix this scarce but valuable target data with abundant generic data, which presents a fundamental trade-off: too little target data in the mixture underexposes the model to the target domain, while too much target data repeats the same examples excessively, yielding diminishing returns and eventual overfitting. We study this trade-off across more than 2,000 language-model training runs spanning multiple model and target dataset sizes, as well as several data types, including multilingual, domain-specific, and quality-filtered mixtures. Across all settings, we find that repetition is a central driver of target-domain performance, and that mixture training tolerates much higher repetition than single-source training: scarce target corpora can be reused 15-20 times, with the optimal number of repetitions depending on the target data size, compute budget, and model scale. Next, we introduce a repetition-aware mixture scaling law that accounts for the decreasing value of repeated target tokens and the regularizing role of generic data. Optimizing the scaling law provides a principled way to compute effective mixture configurations, yielding practical mixture recommendations for pretraining under data constraints.
Problem

Research questions and friction points this paper is trying to address.

data constraints
mixture pretraining
repetition
scaling laws
target-domain performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

mixture pretraining
scaling laws
data repetition
data-constrained training
target-domain adaptation