Blending Supervised and Reinforcement Fine-Tuning with Prefix Sampling

📅 2025-07-02

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

Current large language model post-training faces two key limitations: supervised fine-tuning (SFT) often leads to behavior-cloning generalization, while reinforcement fine-tuning (RFT) relies heavily on a strong initial policy and is prone to learning biased behaviors. To address these issues, we propose Prefix-RFT—a novel framework that unifies SFT and RFT within a prefix-sampling paradigm. Specifically, it integrates demonstration-guided sampling into the standard RFT pipeline, enabling synergistic optimization between imitation learning and policy exploration. Crucially, Prefix-RFT requires no modification to underlying training paradigms and remains fully compatible with mainstream open-source RFT workflows. Experiments demonstrate substantial improvements in complex reasoning tasks—particularly mathematical reasoning—with enhanced performance and generalization across multiple benchmarks, outperforming pure SFT, pure RFT, and parallel hybrid strategies. Moreover, Prefix-RFT exhibits strong robustness to variations in both the quality and quantity of demonstration data.

Technology Category

Application Category

📝 Abstract

Existing post-training techniques for large language models are broadly categorized into Supervised Fine-Tuning (SFT) and Reinforcement Fine-Tuning (RFT). Each paradigm presents a distinct trade-off: SFT excels at mimicking demonstration data but can lead to problematic generalization as a form of behavior cloning. Conversely, RFT can significantly enhance a model's performance but is prone to learn unexpected behaviors, and its performance is highly sensitive to the initial policy. In this paper, we propose a unified view of these methods and introduce Prefix-RFT, a hybrid approach that synergizes learning from both demonstration and exploration. Using mathematical reasoning problems as a testbed, we empirically demonstrate that Prefix-RFT is both simple and effective. It not only surpasses the performance of standalone SFT and RFT but also outperforms parallel mixed-policy RFT methods. A key advantage is its seamless integration into existing open-source frameworks, requiring only minimal modifications to the standard RFT pipeline. Our analysis highlights the complementary nature of SFT and RFT, and validates that Prefix-RFT effectively harmonizes these two learning paradigms. Furthermore, ablation studies confirm the method's robustness to variations in the quality and quantity of demonstration data. We hope this work offers a new perspective on LLM post-training, suggesting that a unified paradigm that judiciously integrates demonstration and exploration could be a promising direction for future research.

Problem

Research questions and friction points this paper is trying to address.

Balancing SFT and RFT trade-offs in LLM fine-tuning

Improving model performance via hybrid demonstration-exploration learning

Ensuring robustness to demonstration data quality variations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid approach combining SFT and RFT

Prefix-RFT integrates demonstration and exploration

Minimal modification to standard RFT pipeline

🔎 Similar Papers

Revisiting Prefix-tuning: Statistical Benefits of Reparameterization among Prompts