Rethinking Generalization in Reasoning SFT: A Conditional Analysis on Optimization, Data, and Model Capability

πŸ“… 2026-04-07
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This study challenges the prevailing view that supervised fine-tuning (SFT) merely memorizes without generalizing, systematically investigating the cross-domain generalization capabilities of long chain-of-thought (CoT) SFT in reasoning tasks. Through cross-domain evaluation, solution trajectory analysis, comparisons across base models of varying scales, and training dynamics tracking, the work reveals that SFT generalization is conditional: stronger models internalize transferable procedural reasoning strategies from simple tasks, exhibiting a non-monotonic β€œdip-and-recover” training dynamic, whereas weaker models only mimic superficial patterns. The study further demonstrates that high-quality long CoT data substantially enhances out-of-domain performance, yet this gain in reasoning capability may come at the cost of reduced safety, highlighting a potential trade-off inherent in generalization.
πŸ“ Abstract
A prevailing narrative in LLM post-training holds that supervised finetuning (SFT) memorizes while reinforcement learning (RL) generalizes. We revisit this claim for reasoning SFT with long chain-of-thought (CoT) supervision and find that cross-domain generalization is not absent but conditional, jointly shaped by optimization dynamics, training data, and base-model capability. Some reported failures are under-optimization artifacts: cross-domain performance first degrades before recovering and improving with extended training (a dip-and-recovery pattern), so shorttraining checkpoints can underestimate generalization. Data quality and structure both matter: low-quality solutions broadly hurt generalization,while verified long-CoT traces yield consistent cross-domain gains. Model capability is essential: stronger models internalize transferable procedural patterns (e.g., backtracking) even from a toy arithmetic game, while weaker ones imitate surface verbosity. This generalization is asymmetric, however: reasoning improves while safety degrades, reframing the question from whether reasoning SFT generalizes to under what conditions and at what cost.
Problem

Research questions and friction points this paper is trying to address.

generalization
reasoning SFT
chain-of-thought
cross-domain
model capability
Innovation

Methods, ideas, or system contributions that make the work stand out.

reasoning SFT
chain-of-thought
cross-domain generalization
optimization dynamics
model capability
πŸ”Ž Similar Papers
No similar papers found.