SRFT: A Single-Stage Method with Supervised and Reinforcement Fine-Tuning for Reasoning

📅 2025-06-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of effectively integrating supervised fine-tuning (SFT) and reinforcement learning (RL) in large language model (LLM) reasoning tasks, this paper proposes SRFT—a single-stage joint optimization framework. Its core contribution lies in the first entropy-based characterization of the granularity disparity between SFT (coarse-grained policy updates) and RL (fine-grained exploration and optimization), enabling an entropy-aware weighting mechanism that unifies demonstration data and self-generated samples for end-to-end co-optimization. SRFT eliminates the suboptimality and training instability inherent in conventional two-stage paradigms. Evaluated on five mathematical reasoning benchmarks, SRFT achieves a mean accuracy of 59.1%, outperforming pure SFT by 9.0%; on three out-of-distribution benchmarks, it improves performance by 10.9%, demonstrating enhanced generalization.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) have achieved remarkable progress in reasoning tasks, yet the optimal integration of Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) remains a fundamental challenge. Through comprehensive analysis of token distributions, learning dynamics, and integration mechanisms from entropy-based perspectives, we reveal key differences between these paradigms: SFT induces coarse-grained global changes to LLM policy distributions, while RL performs fine-grained selective optimizations, with entropy serving as a critical indicator of training effectiveness. Building on these observations, we propose Supervised Reinforcement Fine-Tuning (SRFT), a single-stage method that unifies both fine-tuning paradigms through entropy-aware weighting mechanisms. Our approach simultaneously applies SFT and RL to directly optimize the LLM using demonstrations and self-exploration rollouts rather than through two-stage sequential methods. Extensive experiments show that SRFT achieves 59.1% average accuracy, outperforming zero-RL methods by 9.0% on five mathematical reasoning benchmarks and 10.9% on three out-of-distribution benchmarks.
Problem

Research questions and friction points this paper is trying to address.

Optimizing integration of SFT and RL for reasoning tasks
Unifying fine-tuning paradigms via entropy-aware weighting mechanisms
Improving LLM accuracy on mathematical reasoning benchmarks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unifies SFT and RL via entropy-aware weighting
Single-stage method with direct optimization
Uses demonstrations and self-exploration rollouts