🤖 AI Summary
This work addresses the scarcity of high-quality web data for large language model (LLM) pretraining by investigating whether synthetically generated data exhibits predictable scaling behavior. We propose SynthLLM, a framework that leverages graph-based algorithms to automatically extract high-level cross-document concepts and recombine them into diverse, high-fidelity synthetic corpora. Crucially, we provide the first empirical validation that synthetic data obeys a modified scaling law: downstream performance saturates at approximately 300B tokens, and larger models achieve peak performance with fewer tokens—challenging the conventional assumption of strict data-volume dependence. On an 8B-parameter model, just 1T synthetic tokens suffice to attain optimal performance, significantly outperforming existing approaches. This study establishes the first synthetic-data paradigm for LLM pretraining that is both theoretically grounded—via interpretable scaling laws—and empirically scalable, enabling sustainable LLM advancement amid diminishing real-data resources.
📝 Abstract
Large language models (LLMs) achieve strong performance across diverse tasks, largely driven by high-quality web data used in pre-training. However, recent studies indicate this data source is rapidly depleting. Synthetic data emerges as a promising alternative, but it remains unclear whether synthetic datasets exhibit predictable scalability comparable to raw pre-training data. In this work, we systematically investigate the scaling laws of synthetic data by introducing SynthLLM, a scalable framework that transforms pre-training corpora into diverse, high-quality synthetic datasets. Our approach achieves this by automatically extracting and recombining high-level concepts across multiple documents using a graph algorithm. Key findings from our extensive mathematical experiments on SynthLLM include: (1) SynthLLM generates synthetic data that reliably adheres to the emph{rectified scaling law} across various model sizes; (2) Performance improvements plateau near 300B tokens; and (3) Larger models approach optimal performance with fewer training tokens. For instance, an 8B model peaks at 1T tokens, while a 3B model requires 4T. Moreover, comparisons with existing synthetic data generation and augmentation methods demonstrate that SynthLLM achieves superior performance and scalability. Our findings highlight synthetic data as a scalable and reliable alternative to organic pre-training corpora, offering a viable path toward continued improvement in model performance.