From Acceleration to Saturation: Scaling Behavior of Bootstrapped Language Model Pretraining

📅 2025-10-07

📈 Citations: 0

✨ Influential: 0

career value

155K/year

🤖 AI Summary

This work investigates the scaling efficiency of bootstrapped pretraining—i.e., further pretraining an already well-pretrained language model—addressing the critical question of whether marginal gains diminish as the base model’s pretraining extent increases. Method: Through large-scale empirical analysis and scaling-law modeling, the authors systematically examine how data volume, training steps, and performance gains interact across multi-stage pretraining. Contribution/Results: They discover that the marginal benefit of bootstrapped pretraining decays logarithmically with the base model’s pretraining compute, exhibiting strong saturation. Building on this, they derive the first simple, empirically grounded scaling law that precisely characterizes the data dependency between two pretraining stages. This law enables quantitative prediction of bootstrapping gains for models with arbitrary pretraining baselines, providing both theoretical insight and practical guidance for efficient reuse of existing foundation models and optimal allocation of training resources.

Technology Category

Application Category

📝 Abstract

Bootstrapped pretraining, i.e., the reuse of a pretrained base model for further pretraining, such as continual pretraining or model growth, is promising at reducing the cost of training language models from scratch. However, its effectiveness remains unclear, especially when applied to overtrained base models. In this work, we empirically study the scaling behavior of bootstrapped pretraining and find that its scaling efficiency diminishes in a predictable manner: The scaling exponent with respect to second-stage pretraining tokens decreases logarithmically with the number of tokens used to pretrain the base model. The joint dependence on first- and second-stage tokens is accurately modeled by a simple scaling law. Such saturation effect reveals a fundamental trade-off in multi-stage pretraining strategies: the more extensively a model is pretrained, the less additional benefit bootstrapping provides. Our findings provide practical insights for efficient language model training and raise important considerations for the reuse of overtrained models.

Problem

Research questions and friction points this paper is trying to address.

Studies scaling efficiency decline in bootstrapped language model pretraining

Models saturation effect from base model pretraining token quantity

Reveals trade-off between pretraining extent and bootstrapping benefits

Innovation

Methods, ideas, or system contributions that make the work stand out.

Bootstrapped pretraining reuses base models for cost reduction

Scaling efficiency diminishes predictably with base model tokens

Simple scaling law models multi-stage pretraining saturation effect

🔎 Similar Papers

Large Vocabulary Size Improves Large Language Models