🤖 AI Summary
To address the challenge of effectively integrating supervised fine-tuning (SFT) and reinforcement learning (RL) in large language model (LLM) reasoning tasks, this paper proposes SRFT—a single-stage joint optimization framework. Its core contribution lies in the first entropy-based characterization of the granularity disparity between SFT (coarse-grained policy updates) and RL (fine-grained exploration and optimization), enabling an entropy-aware weighting mechanism that unifies demonstration data and self-generated samples for end-to-end co-optimization. SRFT eliminates the suboptimality and training instability inherent in conventional two-stage paradigms. Evaluated on five mathematical reasoning benchmarks, SRFT achieves a mean accuracy of 59.1%, outperforming pure SFT by 9.0%; on three out-of-distribution benchmarks, it improves performance by 10.9%, demonstrating enhanced generalization.
📝 Abstract
Large language models (LLMs) have achieved remarkable progress in reasoning tasks, yet the optimal integration of Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) remains a fundamental challenge. Through comprehensive analysis of token distributions, learning dynamics, and integration mechanisms from entropy-based perspectives, we reveal key differences between these paradigms: SFT induces coarse-grained global changes to LLM policy distributions, while RL performs fine-grained selective optimizations, with entropy serving as a critical indicator of training effectiveness. Building on these observations, we propose Supervised Reinforcement Fine-Tuning (SRFT), a single-stage method that unifies both fine-tuning paradigms through entropy-aware weighting mechanisms. Our approach simultaneously applies SFT and RL to directly optimize the LLM using demonstrations and self-exploration rollouts rather than through two-stage sequential methods. Extensive experiments show that SRFT achieves 59.1% average accuracy, outperforming zero-RL methods by 9.0% on five mathematical reasoning benchmarks and 10.9% on three out-of-distribution benchmarks.