🤖 AI Summary
This work addresses the challenge of enhancing large language models’ performance on complex reasoning tasks—including mathematics, science, coding, algorithms, planning, and spatial reasoning. We propose a three-stage methodology: (1) teaching-aware filtering of high-quality reasoning data for supervised fine-tuning (SFT); (2) lightweight, outcome-based reinforcement learning to extend reasoning chains and refine final answers with controlled computational cost; and (3) multi-stage reasoning chain distillation coupled with structured prompt engineering. Leveraging this approach, we develop Phi-4-reasoning (14B) and its enhanced variant Phi-4-reasoning-plus, which outperform the 70B DeepSeek-R1-Distill-Llama on multiple domain-specific reasoning benchmarks and approach full DeepSeek-R1 performance—while simultaneously improving general-purpose capabilities. Our study provides the first empirical validation of teaching-aware data curation as critical for reasoning SFT, reveals nontrivial transfer of reasoning proficiency to general tasks, and uncovers systemic limitations in current reasoning evaluation protocols.
📝 Abstract
We introduce Phi-4-reasoning, a 14-billion parameter reasoning model that achieves strong performance on complex reasoning tasks. Trained via supervised fine-tuning of Phi-4 on carefully curated set of"teachable"prompts-selected for the right level of complexity and diversity-and reasoning demonstrations generated using o3-mini, Phi-4-reasoning generates detailed reasoning chains that effectively leverage inference-time compute. We further develop Phi-4-reasoning-plus, a variant enhanced through a short phase of outcome-based reinforcement learning that offers higher performance by generating longer reasoning traces. Across a wide range of reasoning tasks, both models outperform significantly larger open-weight models such as DeepSeek-R1-Distill-Llama-70B model and approach the performance levels of full DeepSeek-R1 model. Our comprehensive evaluations span benchmarks in math and scientific reasoning, coding, algorithmic problem solving, planning, and spatial understanding. Interestingly, we observe a non-trivial transfer of improvements to general-purpose benchmarks as well. In this report, we provide insights into our training data, our training methodologies, and our evaluations. We show that the benefit of careful data curation for supervised fine-tuning (SFT) extends to reasoning language models, and can be further amplified by reinforcement learning (RL). Finally, our evaluation points to opportunities for improving how we assess the performance and robustness of reasoning models.