Data-efficient pre-training by scaling synthetic megadocs

πŸ“… 2026-03-19
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the efficiency bottleneck in existing synthetic data methods under data-constrained pretraining scenarios, where short-text rewriting often limits performance. To overcome this, the authors propose a β€œmegadoc” construction strategy that concatenates multiple rewritten versions of the same document and interleaves them with reasoning content to form ultra-long synthetic documents. This approach enhances both data utilization efficiency and long-context modeling capabilities. By integrating synthetic data augmentation, rewriting, rationale insertion, and an optimized mixing schedule, the method achieves a data efficiency improvement from 1.48Γ— to 1.80Γ— under a 32Γ— synthetic data regime. The resulting model demonstrates significant gains in i.i.d. loss reduction, downstream task accuracy, and long-text comprehension performance.

Technology Category

Application Category

πŸ“ Abstract
Synthetic data augmentation has emerged as a promising solution when pre-training is constrained by data rather than compute. We study how to design synthetic data algorithms that achieve better loss scaling: not only lowering loss at finite compute but especially as compute approaches infinity. We first show that pre-training on web data mixed with synthetically generated rephrases improves i.i.d. validation loss on the web data, despite the synthetic data coming from an entirely different distribution. With optimal mixing and epoching, loss and benchmark accuracy improve without overfitting as the number of synthetic generations grows, plateauing near $1.48\times$ data efficiency at 32 rephrases per document. We find even better loss scaling under a new perspective: synthetic generations from the same document can form a single substantially longer megadocument instead of many short documents. We show two ways to construct megadocs: stitching synthetic rephrases from the same web document or stretching a document by inserting rationales. Both methods improve i.i.d. loss, downstream benchmarks, and especially long-context loss relative to simple rephrasing, increasing data efficiency from $1.48\times$ to $1.80\times$ at $32$ generations per document. Importantly, the improvement of megadocs over simple rephrasing widens as more synthetic data is generated. Our results show how to design synthetic data algorithms that benefit more from increasing compute when data-constrained.
Problem

Research questions and friction points this paper is trying to address.

synthetic data
data efficiency
pre-training
loss scaling
megadocs
Innovation

Methods, ideas, or system contributions that make the work stand out.

synthetic megadocs
data-efficient pre-training
synthetic data augmentation
loss scaling
long-context modeling
πŸ”Ž Similar Papers
No similar papers found.