Long-Short Chain-of-Thought Mixture Supervised Fine-Tuning Eliciting Efficient Reasoning in Large Language Models

📅 2025-05-06

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

Supervised fine-tuning (SFT) of large language models (LLMs) often inherits the teacher model’s “overthinking” behavior, resulting in unnecessarily lengthy and inefficient chain-of-thought (CoT) reasoning traces. Method: We propose Long-Short Mixture Supervised Fine-Tuning (LS-Mixture SFT), a novel CoT data mixing paradigm that preserves structural integrity while integrating both concise rewrites and detailed reasoning samples. Our approach combines CoT distillation, structure-preserving rewriting, and reasoning capability alignment to break the transmission bottleneck of teacher-model deficiencies. Contribution/Results: Evaluated across multiple benchmarks, LS-Mixture SFT achieves an average accuracy improvement of +2.3% and reduces response length by 47.61%, significantly enhancing the trade-off between reasoning efficiency and accuracy without compromising logical completeness.

Technology Category

Application Category

📝 Abstract

Recent advances in large language models have demonstrated that Supervised Fine-Tuning (SFT) with Chain-of-Thought (CoT) reasoning data distilled from large reasoning models (e.g., DeepSeek R1) can effectively transfer reasoning capabilities to non-reasoning models. However, models fine-tuned with this approach inherit the"overthinking"problem from teacher models, producing verbose and redundant reasoning chains during inference. To address this challenge, we propose extbf{L}ong- extbf{S}hort Chain-of-Thought extbf{Mixture} extbf{S}upervised extbf{F}ine- extbf{T}uning ( extbf{LS-Mixture SFT}), which combines long CoT reasoning dataset with their short counterparts obtained through structure-preserved rewriting. Our experiments demonstrate that models trained using the LS-Mixture SFT method, compared to those trained with direct SFT, achieved an average accuracy improvement of 2.3% across various benchmarks while substantially reducing model response length by approximately 47.61%. This work offers an approach to endow non-reasoning models with reasoning capabilities through supervised fine-tuning while avoiding the inherent overthinking problems inherited from teacher models, thereby enabling efficient reasoning in the fine-tuned models.

Problem

Research questions and friction points this paper is trying to address.

Addresses overthinking in fine-tuned reasoning models

Combines long and short reasoning chains for efficiency

Improves accuracy while reducing response length

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines long and short Chain-of-Thought datasets

Uses structure-preserved rewriting for short CoT

Reduces response length while improving accuracy

🔎 Similar Papers

From Sparse Dependence to Sparse Attention: Unveiling How Chain-of-Thought Enhances Transformer Sample Efficiency