Scaling Laws for Mixture Pretraining Under Data Constraints

📅 2026-05-12

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

This work addresses the challenge of effectively pretraining language models when target-domain data are scarce by strategically mixing limited target data with abundant generic data. Through over 2,000 language model experiments, the study systematically investigates how the repetition count and mixing ratio of target data influence downstream performance. It quantifies, for the first time, the tolerance to target data repetition and demonstrates that moderately high repetition (15–20 epochs) substantially enhances target-domain effectiveness. Building on these findings, the authors propose a hybrid scaling law model that integrates the diminishing returns of repeated target data with the regularizing effect of generic data. The model is validated across diverse settings—including multilingual, domain-specific, and quality-filtered corpora—providing both theoretical grounding and practical guidelines for data-constrained pretraining.

📝 Abstract

As language models scale, the amount of data they require grows -- yet many target data sources, such as low-resource languages or specialized domains, are inherently limited in size. A common strategy is to mix this scarce but valuable target data with abundant generic data, which presents a fundamental trade-off: too little target data in the mixture underexposes the model to the target domain, while too much target data repeats the same examples excessively, yielding diminishing returns and eventual overfitting. We study this trade-off across more than 2,000 language-model training runs spanning multiple model and target dataset sizes, as well as several data types, including multilingual, domain-specific, and quality-filtered mixtures. Across all settings, we find that repetition is a central driver of target-domain performance, and that mixture training tolerates much higher repetition than single-source training: scarce target corpora can be reused 15-20 times, with the optimal number of repetitions depending on the target data size, compute budget, and model scale. Next, we introduce a repetition-aware mixture scaling law that accounts for the decreasing value of repeated target tokens and the regularizing role of generic data. Optimizing the scaling law provides a principled way to compute effective mixture configurations, yielding practical mixture recommendations for pretraining under data constraints.

Problem

Research questions and friction points this paper is trying to address.

data constraints

mixture pretraining

repetition

scaling laws

target-domain performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

mixture pretraining

scaling laws

data repetition