🤖 AI Summary
This work addresses the high computational cost and instability in training large reasoning models with reinforcement learning (RL), particularly due to multi-objective rewards that hinder the balance between efficiency and accuracy. The authors propose a simplified on-policy supervised fine-tuning (SFT) approach that eliminates complex RL components such as KL regularization and group normalization, reframing efficiency optimization as a supervised learning task. By applying dual filtering criteria—correctness and conciseness—on self-generated data and incorporating a truncated length penalty mechanism, the method significantly enhances training stability and efficiency. Evaluated across five benchmarks, this approach outperforms existing RL methods, maintaining original accuracy while reducing reasoning chain length by up to 80%, cutting GPU memory usage by 50%, and accelerating convergence by 70%.
📝 Abstract
Large reasoning models (LRMs) are commonly trained with reinforcement learning (RL) to explore long chain-of-thought reasoning, achieving strong performance at high computational cost. Recent methods add multi-reward objectives to jointly optimize correctness and brevity, but these complex extensions often destabilize training and yield suboptimal trade-offs. We revisit this objective and challenge the necessity of such complexity. Through principled analysis, we identify fundamental misalignments in this paradigm: KL regularization loses its intended role when correctness and length are directly verifiable, and group-wise normalization becomes ambiguous under multiple reward signals. By removing these two items and simplifying the reward to a truncation-based length penalty, we show that the optimization problem reduces to supervised fine-tuning on self-generated data filtered for both correctness and conciseness. We term this simplified training strategy on-policy SFT. Despite its simplicity, on-policy SFT consistently defines the accuracy-efficiency Pareto frontier. It reduces CoT length by up to 80 while maintaining original accuracy, surpassing more complex RL-based methods across five benchmarks. Furthermore, it significantly enhances training efficiency, reducing GPU memory usage by 50% and accelerating convergence by 70%. Our code is available at https://github.com/EIT-NLP/On-Policy-SFT.