Debunk the Myth of SFT Generalization

📅 2025-09-30

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

This work challenges the common misconception that supervised fine-tuning (SFT) inherently exhibits weaker generalization than reinforcement learning (RL), attributing its limited generalization primarily to memory effects induced by fixed prompt templates—not intrinsic limitations of SFT itself. To address this, we propose a synergistic mechanism combining prompt diversity sampling and chain-of-thought (CoT) supervision, enhancing instruction coverage and explicit reasoning in the training data. Systematic evaluation on the Sokoban and General Points decision-making benchmarks demonstrates substantial improvements in generalization to unseen instructions, instruction variants, and higher-difficulty tasks; our method matches or surpasses RL baselines across most metrics while preserving SFT’s inherent training stability and implementation simplicity. Our core contribution lies in identifying and resolving a critical data-design bottleneck in SFT generalization, thereby establishing a new paradigm for efficient and robust instruction tuning.

Technology Category

Application Category

📝 Abstract

A prevailing view holds that supervised fine-tuning (SFT) memorizes training data and fails to generalize, whereas reinforcement learning (RL) attains broader robustness. We revisit this claim through a systematic evaluation on two decision-making benchmarks, Sokoban and General Points, and arrive at a different conclusion. We show that much of SFT's perceived failure stems from frozen-prompt artifacts: when trained on fixed instruction templates, SFT models cling to training semantics rather than adapting to new ones. Introducing prompt diversity during training breaks this shortcut and yields strong generalization to unseen instruction variants without harming in-distribution performance. Beyond instruction shifts, we ask whether SFT can generalize to strictly harder tasks. Here, chain-of-thought (CoT) supervision provides an algorithmic scaffold that markedly improves transfer to more difficult regimes, such as larger Sokoban grids with additional boxes and arithmetic with out-of-distribution values or five-card compositions that increase combinatorial complexity. Finally, combining prompt diversity with CoT achieves the best of both worlds: robust generalization across both instruction-variant and difficulty-variant settings, matching or surpassing RL baselines on our benchmarks while retaining SFT's simplicity and stability. These findings challenge the narrative that SFT is inherently inferior to RL and support a data-centric perspective: with appropriately curated demonstrations, vanilla SFT can generalize as strongly as RL. Code reproducing the results in the paper can be found at: https://github.com/XiaofengLin7/debunking-sft-generalization.

Problem

Research questions and friction points this paper is trying to address.

Challenges the view that SFT memorizes data and lacks generalization

Shows SFT generalizes with prompt diversity and chain-of-thought supervision

Demonstrates SFT matches RL performance with curated training data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introducing prompt diversity during supervised fine-tuning

Using chain-of-thought supervision for harder tasks

Combining prompt diversity with chain-of-thought supervision

🔎 Similar Papers

Comparison Performance of Spectrogram and Scalogram as Input of Acoustic Recognition Task