Revisiting the Capacity Gap in Chain-of-Thought Distillation from a Practical Perspective

📅 2026-04-10
📈 Citations: 0
Influential: 0
📄 PDF

career value

192K/year
🤖 AI Summary
This work addresses a critical yet often overlooked issue in chain-of-thought (CoT) distillation: the performance degradation of student models caused by the capability gap between teacher and student. Existing evaluation practices, which focus solely on post-distillation performance, frequently obscure this problem. To remedy this, the paper proposes a refined evaluation protocol that incorporates pre-distillation baseline comparisons, enabling a systematic analysis of how the capability gap affects distillation outcomes across diverse tasks and varying teacher proficiencies. Through carefully designed multitask experiments and controlled ablations, the study demonstrates that the choice of teacher model plays a pivotal role in distillation efficacy. These findings offer practical guidelines for constructing more effective CoT distillation systems.

Technology Category

Application Category

📝 Abstract
Chain-of-thought (CoT) distillation transfers reasoning behaviors from a strong teacher to a smaller student, but prior work reports a capacity gap: distillation may fail when the teacher-student capability mismatch is large. We revisit the capacity gap from a practical perspective by re-examining commonly used experimental settings. Notably, we find that CoT distillation often degrades performance compared to the student's pre-distillation baseline, an issue obscured when only post-distillation comparisons are reported. We therefore propose a more realistic evaluation protocol and find that the impact of capacity gap effects does not consistently dominate across tasks and settings, especially when candidate teachers differ substantially in performance. Our results offer practical guidance for selecting teacher-student pairs in CoT distillation.
Problem

Research questions and friction points this paper is trying to address.

Chain-of-Thought Distillation
Capacity Gap
Teacher-Student Mismatch
Model Distillation
Performance Degradation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Chain-of-Thought Distillation
Capacity Gap
Model Distillation
Teacher-Student Framework
Evaluation Protocol
🔎 Similar Papers
No similar papers found.