π€ AI Summary
This study challenges the prevailing view that supervised fine-tuning (SFT) merely memorizes without generalizing, systematically investigating the cross-domain generalization capabilities of long chain-of-thought (CoT) SFT in reasoning tasks. Through cross-domain evaluation, solution trajectory analysis, comparisons across base models of varying scales, and training dynamics tracking, the work reveals that SFT generalization is conditional: stronger models internalize transferable procedural reasoning strategies from simple tasks, exhibiting a non-monotonic βdip-and-recoverβ training dynamic, whereas weaker models only mimic superficial patterns. The study further demonstrates that high-quality long CoT data substantially enhances out-of-domain performance, yet this gain in reasoning capability may come at the cost of reduced safety, highlighting a potential trade-off inherent in generalization.
π Abstract
A prevailing narrative in LLM post-training holds that supervised finetuning (SFT) memorizes while reinforcement learning (RL) generalizes. We revisit this claim for reasoning SFT with long chain-of-thought (CoT) supervision and find that cross-domain generalization is not absent but conditional, jointly shaped by optimization dynamics, training data, and base-model capability. Some reported failures are under-optimization artifacts: cross-domain performance first degrades before recovering and improving with extended training (a dip-and-recovery pattern), so shorttraining checkpoints can underestimate generalization. Data quality and structure both matter: low-quality solutions broadly hurt generalization,while verified long-CoT traces yield consistent cross-domain gains. Model capability is essential: stronger models internalize transferable procedural patterns (e.g., backtracking) even from a toy arithmetic game, while weaker ones imitate surface verbosity. This generalization is asymmetric, however: reasoning improves while safety degrades, reframing the question from whether reasoning SFT generalizes to under what conditions and at what cost.