SFT-then-RL Outperforms Mixed-Policy Methods for LLM Reasoning

📅 2026-04-26

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

Recent hybrid policy optimization methods have been claimed to outperform the standard supervised fine-tuning (SFT)-then-reinforcement learning (RL) pipeline; however, their baselines suffer from critical implementation flaws. This work systematically reproduces and analyzes mainstream frameworks—including DeepSpeed, TRL, OpenRLHF, and Llama-Factory—and for the first time identifies and quantifies subtle bugs: specifically, an optimizer CPU offloading issue in DeepSpeed and a loss aggregation error in OpenRLHF, both of which significantly degrade SFT performance. After correcting these bugs, we reevaluate both paradigms on large language models for mathematical reasoning. The revised SFT-then-RL approach substantially outperforms all published hybrid methods, achieving gains of 3.8 points on Qwen2.5-Math-7B and 22.2 points on Llama-3.1-8B, with superior results attainable after merely 50 steps of RL fine-tuning—thereby rectifying a prevailing misconception in the field regarding training paradigms.

Technology Category

Application Category

📝 Abstract

Recent mixed-policy optimization methods for LLM reasoning that interleave or blend supervised and reinforcement learning signals report improvements over the standard SFT-then-RL pipeline. We show that numerous recently published research papers rely on a faulty baseline caused by two distinct bugs: a CPU-offloaded optimizer bug in DeepSpeed that silently drops intermediate micro-batches during gradient accumulation (affecting multiple downstream frameworks including TRL, OpenRLHF and Llama-Factory), and a loss aggregation bug in OpenRLHF that incorrectly weights per-mini-batch losses. Together they suppress SFT performance, with the optimizer bug accounting for most of the gap and the loss aggregation bug contributing a smaller additional effect. Once corrected, the standard SFT-then-RL pipeline surpasses every published mixed-policy method we evaluate by +3.8 points on math benchmarks with Qwen2.5-Math-7B and by +22.2 points with Llama-3.1-8B. Even a truncated variant with just 50 RL steps outperforms mixed-policy methods on math benchmarks while using fewer FLOPs.

Problem

Research questions and friction points this paper is trying to address.

SFT-then-RL

mixed-policy optimization

LLM reasoning

baseline bug

performance evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

SFT-then-RL

mixed-policy optimization

optimizer bug