Quagmires in SFT-RL Post-Training: When High SFT Scores Mislead and What to Use Instead

📅 2025-10-01

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

This work challenges the assumed positive correlation between supervised fine-tuning (SFT) performance and subsequent reinforcement learning (RL) gains, revealing that high SFT scores often stem from data simplicity or homogeneity—misleading RL gain prediction and sometimes yielding post-RL performance worse than the unsupervised baseline. To address this, we propose two more reliable proxies for RL effectiveness: generalization loss on held-out reasoning samples and Pass@large-k. Leveraging the GRPO algorithm and verifiable-reward RL, we conduct large-scale experiments across seven mathematical reasoning benchmarks, multiple models (up to 12B parameters), and diverse datasets. Our proposed metrics improve R² and Spearman correlation with actual RL gains by up to 0.5—approximately doubling predictive accuracy. The code and evaluation toolkit are publicly released, providing both theoretical grounding and practical guidance for efficient post-training strategies.

Technology Category

Application Category

📝 Abstract

In post-training for reasoning Large Language Models (LLMs), the current state of practice trains LLMs in two independent stages: Supervised Fine-Tuning (SFT) and Reinforcement Learning with Verifiable Rewards (RLVR, shortened as ``RL'' below). In this work, we challenge whether high SFT scores translate to improved performance after RL. We provide extensive counter-examples where this is not true. We find high SFT scores can be biased toward simpler or more homogeneous data and are not reliably predictive of subsequent RL gains or scaled-up post-training effectiveness. In some cases, RL training on models with improved SFT performance could lead to substantially worse outcome compared to RL on the base model without SFT. We study alternative metrics and identify generalization loss on held-out reasoning examples and Pass@large k performance to provide strong proxies for the RL outcome. We trained hundreds of models up to 12B-parameter with SFT and RLVR via GRPO and ran extensive evaluations on 7 math benchmarks with up to 256 repetitions, spending $>$1M GPU hours. Experiments include models from Llama3, Mistral-Nemo, Qwen3 and multiple state-of-the-art SFT/RL datasets. Compared to directly predicting from pre-RL performance, prediction based on generalization loss and Pass@large k achieves substantial higher precision, improving $R^2$ coefficient and Spearman's rank correlation coefficient by up to 0.5 (2x). This provides strong utility for broad use cases. For example, in most experiments, we find SFT training on unique examples for a one epoch underperforms training on half examples for two epochs, either after SFT or SFT-then-RL; With the same SFT budget, training only on short examples may lead to better SFT performance, though, it often leads to worse outcome after RL compared to training on examples with varying lengths. Evaluation tool will be open-sourced.

Problem

Research questions and friction points this paper is trying to address.

High SFT scores mislead about RL outcomes

Generalization loss predicts RL effectiveness better

Alternative metrics improve post-training performance prediction

Innovation

Methods, ideas, or system contributions that make the work stand out.

Identifies generalization loss as RL predictor

Uses Pass@large k for post-training evaluation

Trains models with GRPO reinforcement learning

🔎 Similar Papers

Don't flatten, tokenize! Unlocking the key to SoftMoE's efficacy in deep RL