Demystifying Synthetic Data in LLM Pre-training: A Systematic Study of Scaling Laws, Benefits, and Pitfalls

📅 2025-10-01

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

High-quality pretraining of large language models (LLMs) is constrained by the scarcity of natural, high-quality textual data, raising the critical question of whether synthetic data can effectively substitute for or augment natural data. Method: We conduct a large-scale ablation study under a unified experimental protocol—training over 1,000 models using >100,000 GPU-hours—systematically evaluating diverse synthetic data types (e.g., paraphrased text, generated textbooks) and their mixed-data training strategies with natural corpora. Contribution/Results: We find that mixing 30% paraphrased data accelerates convergence by 5–10×; synthetic data yields diminishing returns dependent on model scale; and susceptibility to “model collapse” varies significantly across synthetic data types. Crucially, we propose the first practical, model-size- and data-budget-aware heuristic for dynamically allocating synthetic-to-natural data ratios. We empirically refute pure synthetic pretraining but demonstrate that judicious hybrid training achieves both computational efficiency and training stability—providing an evidence-based, scalable paradigm for LLM pretraining.

Technology Category

Application Category

📝 Abstract

Training data plays a crucial role in Large Language Models (LLM) scaling, yet high quality data is of limited supply. Synthetic data techniques offer a potential path toward sidestepping these limitations. We conduct a large-scale empirical investigation (>1000 LLMs with >100k GPU hours) using a unified protocol and scaling laws, comparing natural web data, diverse synthetic types (rephrased text, generated textbooks), and mixtures of natural and synthetic data. Specifically, we found pre-training on rephrased synthetic data extit{alone} is not faster than pre-training on natural web texts; while pre-training on 1/3 rephrased synthetic data mixed with 2/3 natural web texts can speed up 5-10x (to reach the same validation loss) at larger data budgets. Pre-training on textbook-style synthetic data extit{alone} results in notably higher loss on many downstream domains especially at small data budgets. "Good" ratios of synthetic data in training data mixtures depend on the model size and data budget, empirically converging to ~30% for rephrased synthetic data. Larger generator models do not necessarily yield better pre-training data than ~8B-param models. These results contribute mixed evidence on "model collapse" during large-scale single-round (n=1) model training on synthetic data--training on rephrased synthetic data shows no degradation in performance in foreseeable scales whereas training on mixtures of textbook-style pure-generated synthetic data shows patterns predicted by "model collapse". Our work demystifies synthetic data in pre-training, validates its conditional benefits, and offers practical guidance.

Problem

Research questions and friction points this paper is trying to address.

Evaluating synthetic data's effectiveness versus natural web data for LLM pre-training

Determining optimal synthetic-natural data mixtures to accelerate training convergence

Investigating model collapse risks across different synthetic data generation methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixing rephrased synthetic data with natural web texts

Using scaling laws to determine optimal synthetic data ratios

Empirically testing synthetic data types for model collapse

🔎 Similar Papers

No similar papers found.