🤖 AI Summary
Current language model pretraining is limited to intra-document, token-level causal modeling, neglecting learnable semantic relationships across documents. To address this, we propose Synthetic Bootstrapping Pretraining (SBP): first constructing a document-relational model, then generating abstracted synthetic corpora—not mere paraphrases—to enable cross-document conceptual generalization; we further provide a Bayesian-theoretic justification. Using a 3B-parameter model, we conduct compute-matched, from-scratch pretraining on 1T tokens, integrating seed content extraction and novel narrative generation. Experiments demonstrate that SBP significantly outperforms strong repetition-based baselines under identical computational budgets, achieving over 90% of the oracle upper bound attained with 20× more real data. This work constitutes the first empirical validation that explicitly modeling structured inter-document relationships yields substantial gains in pretraining efficacy.
📝 Abstract
We introduce Synthetic Bootstrapped Pretraining (SBP), a language model (LM) pretraining procedure that first learns a model of relations between documents from the pretraining dataset and then leverages it to synthesize a vast new corpus for joint training. While the standard pretraining teaches LMs to learn causal correlations among tokens within a single document, it is not designed to efficiently model the rich, learnable inter-document correlations that can potentially lead to better performance. We validate SBP by designing a compute-matched pretraining setup and pretrain a 3B-parameter model on up to 1T tokens from scratch. We find SBP consistently improves upon a strong repetition baseline and delivers a significant fraction of performance improvement attainable by an oracle upper bound with access to 20x more unique data. Qualitative analysis reveals that the synthesized documents go beyond mere paraphrases -- SBP first abstracts a core concept from the seed material and then crafts a new narration on top of it. Besides strong empirical performance, SBP admits a natural Bayesian interpretation: the synthesizer implicitly learns to abstract the latent concepts shared between related documents.