LLM Post-Training: A Deep Dive into Reasoning Large Language Models

📅 2025-02-28

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

This paper addresses three core challenges in large language model (LLM) post-training: catastrophic forgetting, reward hacking, and the reasoning–utility trade-off. It establishes the first unified analytical framework encompassing alignment, scalable adaptation, and inference-time optimization. The proposed methodology integrates multiple complementary pathways—supervised fine-tuning, RLHF/PPO, test-time scaling, chain-of-thought distillation, and dynamic inference control—alongside a novel mitigation paradigm. Key contributions include substantial improvements in reasoning capability, factual accuracy, and user-intent alignment. Furthermore, the authors release *Awesome-LLM-Post-training*, an open-source knowledge repository that systematizes critical challenges, technical approaches, and emerging research directions. This resource serves as a reproducible methodological guide and evolutionary roadmap for both academia and industry, bridging theoretical insight with practical deployment requirements in LLM alignment and adaptation.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have transformed the natural language processing landscape and brought to life diverse applications. Pretraining on vast web-scale data has laid the foundation for these models, yet the research community is now increasingly shifting focus toward post-training techniques to achieve further breakthroughs. While pretraining provides a broad linguistic foundation, post-training methods enable LLMs to refine their knowledge, improve reasoning, enhance factual accuracy, and align more effectively with user intents and ethical considerations. Fine-tuning, reinforcement learning, and test-time scaling have emerged as critical strategies for optimizing LLMs performance, ensuring robustness, and improving adaptability across various real-world tasks. This survey provides a systematic exploration of post-training methodologies, analyzing their role in refining LLMs beyond pretraining, addressing key challenges such as catastrophic forgetting, reward hacking, and inference-time trade-offs. We highlight emerging directions in model alignment, scalable adaptation, and inference-time reasoning, and outline future research directions. We also provide a public repository to continually track developments in this fast-evolving field: https://github.com/mbzuai-oryx/Awesome-LLM-Post-training.

Problem

Research questions and friction points this paper is trying to address.

Refining knowledge and reasoning in Large Language Models

Addressing challenges like catastrophic forgetting and reward hacking

Improving model alignment and inference-time reasoning capabilities

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tuning enhances LLM reasoning and accuracy.

Reinforcement learning aligns models with user intents.

Test-time scaling improves adaptability and robustness.

🔎 Similar Papers

Semantic Self-Consistency: Enhancing Language Model Reasoning via Semantic Weighting