Rethinking Generalization in Reasoning SFT: A Conditional Analysis on Optimization, Data, and Model Capability

📅 2026-04-07

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

This study challenges the prevailing view that supervised fine-tuning (SFT) merely memorizes without generalizing, systematically investigating the cross-domain generalization capabilities of long chain-of-thought (CoT) SFT in reasoning tasks. Through cross-domain evaluation, solution trajectory analysis, comparisons across base models of varying scales, and training dynamics tracking, the work reveals that SFT generalization is conditional: stronger models internalize transferable procedural reasoning strategies from simple tasks, exhibiting a non-monotonic “dip-and-recover” training dynamic, whereas weaker models only mimic superficial patterns. The study further demonstrates that high-quality long CoT data substantially enhances out-of-domain performance, yet this gain in reasoning capability may come at the cost of reduced safety, highlighting a potential trade-off inherent in generalization.

Technology Category

Application Category

📝 Abstract

A prevailing narrative in LLM post-training holds that supervised finetuning (SFT) memorizes while reinforcement learning (RL) generalizes. We revisit this claim for reasoning SFT with long chain-of-thought (CoT) supervision and find that cross-domain generalization is not absent but conditional, jointly shaped by optimization dynamics, training data, and base-model capability. Some reported failures are under-optimization artifacts: cross-domain performance first degrades before recovering and improving with extended training (a dip-and-recovery pattern), so shorttraining checkpoints can underestimate generalization. Data quality and structure both matter: low-quality solutions broadly hurt generalization,while verified long-CoT traces yield consistent cross-domain gains. Model capability is essential: stronger models internalize transferable procedural patterns (e.g., backtracking) even from a toy arithmetic game, while weaker ones imitate surface verbosity. This generalization is asymmetric, however: reasoning improves while safety degrades, reframing the question from whether reasoning SFT generalizes to under what conditions and at what cost.

Problem

Research questions and friction points this paper is trying to address.

generalization

reasoning SFT

chain-of-thought

cross-domain

model capability

Innovation

Methods, ideas, or system contributions that make the work stand out.

reasoning SFT

chain-of-thought

cross-domain generalization