Distilled Pretraining: A modern lens of Data, In-Context Learning and Test-Time Scaling

📅 2025-09-01

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

This work investigates the differential impact of knowledge distillation pretraining on test-time scaling and in-context learning (ICL) in modern large language models (LLMs). Through systematic experiments on both real-world LLMs and analytically tractable binary models—augmented by induced-head modeling, quantitative ICL capability measurement, and joint distillation–scaling analysis—we establish, for the first time, that distillation consistently enhances test-time scaling performance while systematically degrading induced-head–based ICL. This trade-off arises because distillation compresses the model’s implicit contextual dependency modeling, favoring deterministic reasoning pathways over flexible, context-sensitive inference. Our findings provide critical theoretical insights and practical guidance for pretraining paradigm design: optimizing for test-time robustness may inadvertently compromise ICL capability, necessitating explicit mitigation strategies to avoid ICL degradation.

Technology Category

Application Category

📝 Abstract

In the past year, distillation has seen a renewed prominence in large language model (LLM) pretraining, exemplified by the Llama-3.2 and Gemma model families. While distillation has historically been shown to improve statistical modeling, its effects on new paradigms that are key to modern LLMs, such as test-time scaling and in-context learning, remain underexplored. In this work, we make three main contributions. First, we show that pretraining with distillation yields models that exhibit remarkably better test-time scaling. Second, we observe that this benefit comes with a trade-off: distillation impairs in-context learning capabilities, particularly the one modeled via induction heads. Third, to demystify these findings, we study distilled pretraining in a sandbox of a bigram model, which helps us isolate the common principal factor behind our observations. Finally, using these insights, we shed light on various design choices for pretraining that should help practitioners going forward.

Problem

Research questions and friction points this paper is trying to address.

Investigating distillation's impact on test-time scaling

Exploring trade-offs between distillation and in-context learning

Identifying principal factors affecting distilled pretraining performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Distilled pretraining improves test-time scaling

Distillation impairs in-context learning capabilities

Sandbox bigram model isolates principal factors

🔎 Similar Papers

No similar papers found.