Scaling Laws of Synthetic Data for Language Models

📅 2025-03-25

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

This work addresses the scarcity of high-quality web data for large language model (LLM) pretraining by investigating whether synthetically generated data exhibits predictable scaling behavior. We propose SynthLLM, a framework that leverages graph-based algorithms to automatically extract high-level cross-document concepts and recombine them into diverse, high-fidelity synthetic corpora. Crucially, we provide the first empirical validation that synthetic data obeys a modified scaling law: downstream performance saturates at approximately 300B tokens, and larger models achieve peak performance with fewer tokens—challenging the conventional assumption of strict data-volume dependence. On an 8B-parameter model, just 1T synthetic tokens suffice to attain optimal performance, significantly outperforming existing approaches. This study establishes the first synthetic-data paradigm for LLM pretraining that is both theoretically grounded—via interpretable scaling laws—and empirically scalable, enabling sustainable LLM advancement amid diminishing real-data resources.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) achieve strong performance across diverse tasks, largely driven by high-quality web data used in pre-training. However, recent studies indicate this data source is rapidly depleting. Synthetic data emerges as a promising alternative, but it remains unclear whether synthetic datasets exhibit predictable scalability comparable to raw pre-training data. In this work, we systematically investigate the scaling laws of synthetic data by introducing SynthLLM, a scalable framework that transforms pre-training corpora into diverse, high-quality synthetic datasets. Our approach achieves this by automatically extracting and recombining high-level concepts across multiple documents using a graph algorithm. Key findings from our extensive mathematical experiments on SynthLLM include: (1) SynthLLM generates synthetic data that reliably adheres to the emph{rectified scaling law} across various model sizes; (2) Performance improvements plateau near 300B tokens; and (3) Larger models approach optimal performance with fewer training tokens. For instance, an 8B model peaks at 1T tokens, while a 3B model requires 4T. Moreover, comparisons with existing synthetic data generation and augmentation methods demonstrate that SynthLLM achieves superior performance and scalability. Our findings highlight synthetic data as a scalable and reliable alternative to organic pre-training corpora, offering a viable path toward continued improvement in model performance.

Problem

Research questions and friction points this paper is trying to address.

Investigates scalability of synthetic data for language models

Explores if synthetic data follows predictable scaling laws

Compares synthetic data performance with organic pre-training corpora

Innovation

Methods, ideas, or system contributions that make the work stand out.

SynthLLM framework generates diverse synthetic datasets

Graph algorithm extracts and recombines high-level concepts

Synthetic data adheres to rectified scaling law

🔎 Similar Papers

Large Vocabulary Size Improves Large Language Models

2024-06-24arXiv.orgCitations: 3

💼 Related Jobs

Research Engineer, Monetization AI