Synthetic bootstrapped pretraining

📅 2025-09-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current language model pretraining is limited to intra-document, token-level causal modeling, neglecting learnable semantic relationships across documents. To address this, we propose Synthetic Bootstrapping Pretraining (SBP): first constructing a document-relational model, then generating abstracted synthetic corpora—not mere paraphrases—to enable cross-document conceptual generalization; we further provide a Bayesian-theoretic justification. Using a 3B-parameter model, we conduct compute-matched, from-scratch pretraining on 1T tokens, integrating seed content extraction and novel narrative generation. Experiments demonstrate that SBP significantly outperforms strong repetition-based baselines under identical computational budgets, achieving over 90% of the oracle upper bound attained with 20× more real data. This work constitutes the first empirical validation that explicitly modeling structured inter-document relationships yields substantial gains in pretraining efficacy.

Technology Category

Application Category

📝 Abstract
We introduce Synthetic Bootstrapped Pretraining (SBP), a language model (LM) pretraining procedure that first learns a model of relations between documents from the pretraining dataset and then leverages it to synthesize a vast new corpus for joint training. While the standard pretraining teaches LMs to learn causal correlations among tokens within a single document, it is not designed to efficiently model the rich, learnable inter-document correlations that can potentially lead to better performance. We validate SBP by designing a compute-matched pretraining setup and pretrain a 3B-parameter model on up to 1T tokens from scratch. We find SBP consistently improves upon a strong repetition baseline and delivers a significant fraction of performance improvement attainable by an oracle upper bound with access to 20x more unique data. Qualitative analysis reveals that the synthesized documents go beyond mere paraphrases -- SBP first abstracts a core concept from the seed material and then crafts a new narration on top of it. Besides strong empirical performance, SBP admits a natural Bayesian interpretation: the synthesizer implicitly learns to abstract the latent concepts shared between related documents.
Problem

Research questions and friction points this paper is trying to address.

Improves language model pretraining using inter-document correlations
Synthesizes new training corpus from learned document relations
Addresses limitations of standard token-level causal modeling
Innovation

Methods, ideas, or system contributions that make the work stand out.

Synthesizes new corpus using document relations
Learns inter-document correlations beyond token-level
Abstracts latent concepts for Bayesian interpretation
🔎 Similar Papers
No similar papers found.