🤖 AI Summary
Recent hybrid policy optimization methods have been claimed to outperform the standard supervised fine-tuning (SFT)-then-reinforcement learning (RL) pipeline; however, their baselines suffer from critical implementation flaws. This work systematically reproduces and analyzes mainstream frameworks—including DeepSpeed, TRL, OpenRLHF, and Llama-Factory—and for the first time identifies and quantifies subtle bugs: specifically, an optimizer CPU offloading issue in DeepSpeed and a loss aggregation error in OpenRLHF, both of which significantly degrade SFT performance. After correcting these bugs, we reevaluate both paradigms on large language models for mathematical reasoning. The revised SFT-then-RL approach substantially outperforms all published hybrid methods, achieving gains of 3.8 points on Qwen2.5-Math-7B and 22.2 points on Llama-3.1-8B, with superior results attainable after merely 50 steps of RL fine-tuning—thereby rectifying a prevailing misconception in the field regarding training paradigms.
📝 Abstract
Recent mixed-policy optimization methods for LLM reasoning that interleave or blend supervised and reinforcement learning signals report improvements over the standard SFT-then-RL pipeline. We show that numerous recently published research papers rely on a faulty baseline caused by two distinct bugs: a CPU-offloaded optimizer bug in DeepSpeed that silently drops intermediate micro-batches during gradient accumulation (affecting multiple downstream frameworks including TRL, OpenRLHF and Llama-Factory), and a loss aggregation bug in OpenRLHF that incorrectly weights per-mini-batch losses. Together they suppress SFT performance, with the optimizer bug accounting for most of the gap and the loss aggregation bug contributing a smaller additional effect. Once corrected, the standard SFT-then-RL pipeline surpasses every published mixed-policy method we evaluate by +3.8 points on math benchmarks with Qwen2.5-Math-7B and by +22.2 points with Llama-3.1-8B. Even a truncated variant with just 50 RL steps outperforms mixed-policy methods on math benchmarks while using fewer FLOPs.