The Coverage Principle: How Pre-training Enables Post-Training

📅 2025-10-16

📈 Citations: 0

✨ Influential: 0

career value

169K/year

🤖 AI Summary

Downstream performance of pretrained language models often exhibits weak correlation with pretraining cross-entropy loss, and the underlying mechanism remains unclear. Method: We introduce the “coverage” theoretical framework—defined as the model’s probability mass assigned to high-quality responses—and establish it as a necessary and sufficient condition for successful post-training and test-time scaling (e.g., Best-of-N). Through theoretical analysis and algorithmic design, we show that coverage generalizes faster and predicts downstream performance more accurately than cross-entropy; reveal that next-token prediction implicitly optimizes coverage; and propose provably effective interventions—including gradient normalization, model selection, and decoding strategies—to enhance coverage. Results: Extensive experiments demonstrate coverage’s strong predictive power for post-training efficacy across diverse tasks, thereby establishing a rigorous theoretical bridge linking pretraining objectives, coverage, and downstream performance.

Technology Category

Application Category

📝 Abstract

Language models demonstrate remarkable abilities when pre-trained on large text corpora and fine-tuned for specific tasks, but how and why pre-training shapes the success of the final model remains poorly understood. Notably, although pre-training success is often quantified by cross entropy loss, cross-entropy can be a poor predictor of downstream performance. Instead, we provide a theoretical perspective on this relationship through the lens of emph{coverage}, which quantifies the probability mass the pre-trained model places on high-quality responses and which is necessary and sufficient for post-training and test-time scaling methods such as Best-of-N to succeed. Our main results develop an understanding of emph{the coverage principle}, a phenomenon whereby next-token prediction implicitly optimizes toward a model with good coverage. In particular, we uncover a mechanism that explains the power of coverage in predicting downstream performance: emph{coverage generalizes faster than cross entropy}, avoiding spurious dependence on problem-dependent parameters such as the sequence length. We also study practical algorithmic interventions with provable benefits for improving coverage, including (i) model/checkpoint selection procedures, (ii) gradient normalization schemes, and (iii) test-time decoding strategies.

Problem

Research questions and friction points this paper is trying to address.

Explaining why pre-training enables effective post-training and downstream task performance

Proposing coverage as a better predictor than cross entropy for model success

Developing theoretical understanding and practical methods to improve model coverage

Innovation

Methods, ideas, or system contributions that make the work stand out.

Pre-training optimizes model coverage for responses

Coverage predicts downstream performance better than loss

Algorithmic interventions improve coverage with provable benefits

🔎 Similar Papers

Accelerating Large Language Model Pretraining via LFR Pedagogy: Learn, Focus, and Review