🤖 AI Summary
In conventional two-stage training, supervised fine-tuning (SFT) and reinforcement learning (RL) are decoupled, hindering the co-optimization of large language models’ (LLMs) reasoning capabilities.
Method: We propose a bilevel optimization framework for SFT–RL co-training: the upper level optimizes a meta-objective—collaborative gain—to enable SFT to dynamically learn how to guide RL policy updates; the lower level explicitly incorporates differentiable supervision signals into RL, enabling joint training driven by policy gradients. This breaks stage isolation and allows real-time, differentiable feedback from SFT to RL.
Contribution/Results: Evaluated on five mainstream reasoning benchmarks, our method significantly outperforms the standard SFT+RL baseline, achieving superior trade-offs between reasoning performance and training efficiency. Empirical results validate the effectiveness of the co-optimization paradigm for LLM reasoning.
📝 Abstract
Reinforcement learning (RL) has proven effective in incentivizing the reasoning abilities of large language models (LLMs), but suffers from severe efficiency challenges due to its trial-and-error nature. While the common practice employs supervised fine-tuning (SFT) as a warm-up stage for RL, this decoupled two-stage approach limits interaction between SFT and RL, thereby constraining overall effectiveness. This study introduces a novel method for learning reasoning models that employs bilevel optimization to facilitate better cooperation between these training paradigms. By conditioning the SFT objective on the optimal RL policy, our approach enables SFT to meta-learn how to guide RL's optimization process. During training, the lower level performs RL updates while simultaneously receiving SFT supervision, and the upper level explicitly maximizes the cooperative gain-the performance advantage of joint SFT-RL training over RL alone. Empirical evaluations on five reasoning benchmarks demonstrate that our method consistently outperforms baselines and achieves a better balance between effectiveness and efficiency.