🤖 AI Summary
Supervised fine-tuning (SFT) of large language models (LLMs) often inherits the teacher model’s “overthinking” behavior, resulting in unnecessarily lengthy and inefficient chain-of-thought (CoT) reasoning traces.
Method: We propose Long-Short Mixture Supervised Fine-Tuning (LS-Mixture SFT), a novel CoT data mixing paradigm that preserves structural integrity while integrating both concise rewrites and detailed reasoning samples. Our approach combines CoT distillation, structure-preserving rewriting, and reasoning capability alignment to break the transmission bottleneck of teacher-model deficiencies.
Contribution/Results: Evaluated across multiple benchmarks, LS-Mixture SFT achieves an average accuracy improvement of +2.3% and reduces response length by 47.61%, significantly enhancing the trade-off between reasoning efficiency and accuracy without compromising logical completeness.
📝 Abstract
Recent advances in large language models have demonstrated that Supervised Fine-Tuning (SFT) with Chain-of-Thought (CoT) reasoning data distilled from large reasoning models (e.g., DeepSeek R1) can effectively transfer reasoning capabilities to non-reasoning models. However, models fine-tuned with this approach inherit the"overthinking"problem from teacher models, producing verbose and redundant reasoning chains during inference. To address this challenge, we propose extbf{L}ong- extbf{S}hort Chain-of-Thought extbf{Mixture} extbf{S}upervised extbf{F}ine- extbf{T}uning ( extbf{LS-Mixture SFT}), which combines long CoT reasoning dataset with their short counterparts obtained through structure-preserved rewriting. Our experiments demonstrate that models trained using the LS-Mixture SFT method, compared to those trained with direct SFT, achieved an average accuracy improvement of 2.3% across various benchmarks while substantially reducing model response length by approximately 47.61%. This work offers an approach to endow non-reasoning models with reasoning capabilities through supervised fine-tuning while avoiding the inherent overthinking problems inherited from teacher models, thereby enabling efficient reasoning in the fine-tuned models.