Synthetic bootstrapped pretraining

📅 2025-09-17

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

Current language model pretraining is limited to intra-document, token-level causal modeling, neglecting learnable semantic relationships across documents. To address this, we propose Synthetic Bootstrapping Pretraining (SBP): first constructing a document-relational model, then generating abstracted synthetic corpora—not mere paraphrases—to enable cross-document conceptual generalization; we further provide a Bayesian-theoretic justification. Using a 3B-parameter model, we conduct compute-matched, from-scratch pretraining on 1T tokens, integrating seed content extraction and novel narrative generation. Experiments demonstrate that SBP significantly outperforms strong repetition-based baselines under identical computational budgets, achieving over 90% of the oracle upper bound attained with 20× more real data. This work constitutes the first empirical validation that explicitly modeling structured inter-document relationships yields substantial gains in pretraining efficacy.

Technology Category

Application Category

📝 Abstract

We introduce Synthetic Bootstrapped Pretraining (SBP), a language model (LM) pretraining procedure that first learns a model of relations between documents from the pretraining dataset and then leverages it to synthesize a vast new corpus for joint training. While the standard pretraining teaches LMs to learn causal correlations among tokens within a single document, it is not designed to efficiently model the rich, learnable inter-document correlations that can potentially lead to better performance. We validate SBP by designing a compute-matched pretraining setup and pretrain a 3B-parameter model on up to 1T tokens from scratch. We find SBP consistently improves upon a strong repetition baseline and delivers a significant fraction of performance improvement attainable by an oracle upper bound with access to 20x more unique data. Qualitative analysis reveals that the synthesized documents go beyond mere paraphrases -- SBP first abstracts a core concept from the seed material and then crafts a new narration on top of it. Besides strong empirical performance, SBP admits a natural Bayesian interpretation: the synthesizer implicitly learns to abstract the latent concepts shared between related documents.

Problem

Research questions and friction points this paper is trying to address.

Improves language model pretraining using inter-document correlations

Synthesizes new training corpus from learned document relations

Addresses limitations of standard token-level causal modeling

Innovation

Methods, ideas, or system contributions that make the work stand out.

Synthesizes new corpus using document relations

Learns inter-document correlations beyond token-level

Abstracts latent concepts for Bayesian interpretation

🔎 Similar Papers

Masked Capsule Autoencoders