A Survey on LLM Mid-training

📅 2025-10-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
The “mid-training” phase—occurring between pretraining and post-training—has been largely overlooked in large language model (LLM) development, despite its critical role in balancing targeted capability enhancement with preservation of foundational language modeling performance. Method: We formally define and categorize mid-training for the first time, proposing a multi-stage optimization framework encompassing data curation, curriculum learning, continued pretraining, instruction tuning, and architectural expansion. Contribution/Results: Empirical evaluation demonstrates that our approach systematically improves target capabilities—including mathematical reasoning, code generation, complex reasoning, and long-context understanding—while robustly maintaining general language modeling performance. This work establishes the first theoretical framework and practical guidelines for mid-training, enabling reproducible, controllable, and efficient LLM capability evolution and domain-specific customization.

Technology Category

Application Category

📝 Abstract
Recent advances in foundation models have highlighted the significant benefits of multi-stage training, with a particular emphasis on the emergence of mid-training as a vital stage that bridges pre-training and post-training. Mid-training is distinguished by its use of intermediate data and computational resources, systematically enhancing specified capabilities such as mathematics, coding, reasoning, and long-context extension, while maintaining foundational competencies. This survey provides a formal definition of mid-training for large language models (LLMs) and investigates optimization frameworks that encompass data curation, training strategies, and model architecture optimization. We analyze mainstream model implementations in the context of objective-driven interventions, illustrating how mid-training serves as a distinct and critical stage in the progressive development of LLM capabilities. By clarifying the unique contributions of mid-training, this survey offers a comprehensive taxonomy and actionable insights, supporting future research and innovation in the advancement of LLMs.
Problem

Research questions and friction points this paper is trying to address.

Defining mid-training as a critical LLM development stage
Optimizing frameworks for data, training strategies, and architecture
Enhancing specific capabilities like reasoning while maintaining core skills
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mid-training bridges pre-training and post-training stages
Uses intermediate data to enhance specific LLM capabilities
Optimizes data curation, training strategies and model architecture
🔎 Similar Papers
No similar papers found.