In Their Own Words: Reasoning Traces Tailored for Small Models Make Them Better Reasoners

📅 2025-09-26

📈 Citations: 0

✨ Influential: 0

career value

151K/year

🤖 AI Summary

To address performance degradation in small language models (SLMs) caused by distribution mismatch when transferring reasoning capabilities from large language models (LLMs), this paper proposes Reverse Speculative Decoding (RSD). In RSD, the student model—not the teacher—orchestrates the inference trajectory: it dynamically accepts or rejects candidate tokens proposed by the LLM based on its own output distribution, thereby achieving distribution alignment. This paradigm shifts knowledge distillation from teacher-centric token imitation to student-centric, distribution-adaptive step selection, overcoming the long-standing challenge of learning low-probability tokens under standard distillation. Experiments on Qwen3-0.6B show that conventional distillation degrades reasoning performance by 20.5%, whereas RSD-enhanced supervised fine-tuning yields an average 4.9% improvement. The core contribution is the first formalization of a student-driven decoding framework, where the student model assumes primary control over token generation and rejection, enabling effective mitigation of distillation mismatch through reverse, distribution-aware supervision.

Technology Category

Application Category

📝 Abstract

Transferring reasoning capabilities from larger language models to smaller ones through supervised fine-tuning often fails counterintuitively, with performance degrading despite access to high-quality teacher demonstrations. We identify that this failure stems from distributional misalignment: reasoning traces from larger models contain tokens that are low probability under the student's distribution, exceeding the internal representation capacity of smaller architectures and creating learning barriers rather than helpful guidance. We propose Reverse Speculative Decoding (RSD), a mechanism for generating student-friendly reasoning traces in which the teacher model proposes candidate tokens but the student model determines acceptance based on its own probability distributions, filtering low probability tokens. When applied to Qwen3-0.6B, direct distillation of s1K-1.1 reasoning trace data degrades average performance across major reasoning benchmarks by 20.5%, while the same model trained on RSD-generated reasoning traces achieves meaningful improvements of 4.9%. Our analysis reveals that low probability tokens constitute the critical bottleneck in reasoning ability transfer. However, cross-model experiments demonstrate that RSD traces are model-specific rather than universally applicable, indicating that distributional alignment must be tailored for each student architecture's unique internal representation.

Problem

Research questions and friction points this paper is trying to address.

Transferring reasoning from large to small models fails

Distributional misalignment creates learning barriers for students

Student-specific traces are needed for effective knowledge transfer

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reverse Speculative Decoding generates student-friendly reasoning traces

Teacher proposes tokens while student filters by its probability distributions

Tailored distributional alignment for each student architecture

🔎 Similar Papers

No similar papers found.