🤖 AI Summary
This paper addresses three core challenges in large language model (LLM) post-training: catastrophic forgetting, reward hacking, and the reasoning–utility trade-off. It establishes the first unified analytical framework encompassing alignment, scalable adaptation, and inference-time optimization. The proposed methodology integrates multiple complementary pathways—supervised fine-tuning, RLHF/PPO, test-time scaling, chain-of-thought distillation, and dynamic inference control—alongside a novel mitigation paradigm. Key contributions include substantial improvements in reasoning capability, factual accuracy, and user-intent alignment. Furthermore, the authors release *Awesome-LLM-Post-training*, an open-source knowledge repository that systematizes critical challenges, technical approaches, and emerging research directions. This resource serves as a reproducible methodological guide and evolutionary roadmap for both academia and industry, bridging theoretical insight with practical deployment requirements in LLM alignment and adaptation.
📝 Abstract
Large Language Models (LLMs) have transformed the natural language processing landscape and brought to life diverse applications. Pretraining on vast web-scale data has laid the foundation for these models, yet the research community is now increasingly shifting focus toward post-training techniques to achieve further breakthroughs. While pretraining provides a broad linguistic foundation, post-training methods enable LLMs to refine their knowledge, improve reasoning, enhance factual accuracy, and align more effectively with user intents and ethical considerations. Fine-tuning, reinforcement learning, and test-time scaling have emerged as critical strategies for optimizing LLMs performance, ensuring robustness, and improving adaptability across various real-world tasks. This survey provides a systematic exploration of post-training methodologies, analyzing their role in refining LLMs beyond pretraining, addressing key challenges such as catastrophic forgetting, reward hacking, and inference-time trade-offs. We highlight emerging directions in model alignment, scalable adaptation, and inference-time reasoning, and outline future research directions. We also provide a public repository to continually track developments in this fast-evolving field: https://github.com/mbzuai-oryx/Awesome-LLM-Post-training.