🤖 AI Summary
The “mid-training” phase—occurring between pretraining and post-training—has been largely overlooked in large language model (LLM) development, despite its critical role in balancing targeted capability enhancement with preservation of foundational language modeling performance.
Method: We formally define and categorize mid-training for the first time, proposing a multi-stage optimization framework encompassing data curation, curriculum learning, continued pretraining, instruction tuning, and architectural expansion.
Contribution/Results: Empirical evaluation demonstrates that our approach systematically improves target capabilities—including mathematical reasoning, code generation, complex reasoning, and long-context understanding—while robustly maintaining general language modeling performance. This work establishes the first theoretical framework and practical guidelines for mid-training, enabling reproducible, controllable, and efficient LLM capability evolution and domain-specific customization.
📝 Abstract
Recent advances in foundation models have highlighted the significant benefits of multi-stage training, with a particular emphasis on the emergence of mid-training as a vital stage that bridges pre-training and post-training. Mid-training is distinguished by its use of intermediate data and computational resources, systematically enhancing specified capabilities such as mathematics, coding, reasoning, and long-context extension, while maintaining foundational competencies. This survey provides a formal definition of mid-training for large language models (LLMs) and investigates optimization frameworks that encompass data curation, training strategies, and model architecture optimization. We analyze mainstream model implementations in the context of objective-driven interventions, illustrating how mid-training serves as a distinct and critical stage in the progressive development of LLM capabilities. By clarifying the unique contributions of mid-training, this survey offers a comprehensive taxonomy and actionable insights, supporting future research and innovation in the advancement of LLMs.