Beyond Two-Stage Training: Cooperative SFT and RL for LLM Reasoning

📅 2025-09-08
📈 Citations: 0
Influential: 0
📄 PDF

career value

183K/year
🤖 AI Summary
In conventional two-stage training, supervised fine-tuning (SFT) and reinforcement learning (RL) are decoupled, hindering the co-optimization of large language models’ (LLMs) reasoning capabilities. Method: We propose a bilevel optimization framework for SFT–RL co-training: the upper level optimizes a meta-objective—collaborative gain—to enable SFT to dynamically learn how to guide RL policy updates; the lower level explicitly incorporates differentiable supervision signals into RL, enabling joint training driven by policy gradients. This breaks stage isolation and allows real-time, differentiable feedback from SFT to RL. Contribution/Results: Evaluated on five mainstream reasoning benchmarks, our method significantly outperforms the standard SFT+RL baseline, achieving superior trade-offs between reasoning performance and training efficiency. Empirical results validate the effectiveness of the co-optimization paradigm for LLM reasoning.

Technology Category

Application Category

📝 Abstract
Reinforcement learning (RL) has proven effective in incentivizing the reasoning abilities of large language models (LLMs), but suffers from severe efficiency challenges due to its trial-and-error nature. While the common practice employs supervised fine-tuning (SFT) as a warm-up stage for RL, this decoupled two-stage approach limits interaction between SFT and RL, thereby constraining overall effectiveness. This study introduces a novel method for learning reasoning models that employs bilevel optimization to facilitate better cooperation between these training paradigms. By conditioning the SFT objective on the optimal RL policy, our approach enables SFT to meta-learn how to guide RL's optimization process. During training, the lower level performs RL updates while simultaneously receiving SFT supervision, and the upper level explicitly maximizes the cooperative gain-the performance advantage of joint SFT-RL training over RL alone. Empirical evaluations on five reasoning benchmarks demonstrate that our method consistently outperforms baselines and achieves a better balance between effectiveness and efficiency.
Problem

Research questions and friction points this paper is trying to address.

Improving efficiency of RL training for LLM reasoning
Enhancing cooperation between SFT and RL stages
Balancing effectiveness and efficiency in reasoning models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Bilevel optimization for SFT-RL cooperation
SFT meta-learns to guide RL optimization
Explicitly maximizes joint training performance gain