"The Whole Is Greater Than the Sum of Its Parts": A Compatibility-Aware Multi-Teacher CoT Distillation Framework

📅 2026-01-20

📈 Citations: 0

✨ Influential: 0

career value

158K/year

🤖 AI Summary

This work addresses the limitations of existing chain-of-thought (CoT) distillation methods, which rely on a single teacher model and are thus prone to its capability biases and catastrophic forgetting, hindering the full realization of student models’ reasoning potential. To overcome this, we propose COMPACT, a multi-teacher collaborative distillation framework that dynamically integrates supervision signals compatible with the student’s evolving capabilities. COMPACT introduces a multidimensional compatibility assessment mechanism: it filters erroneous reasoning paths via graph-based consensus, identifies informative teaching moments through mutual information, and mitigates negative transfer by modulating loss difficulty. By integrating graph-structured analysis, mutual information estimation, and dynamic gradient weighting, COMPACT constructs a compatibility-aware distillation system. Experiments demonstrate that COMPACT achieves state-of-the-art performance across multiple reasoning benchmarks, effectively enhancing small language models’ reasoning abilities while alleviating catastrophic forgetting and preserving their original knowledge.

Technology Category

Application Category

📝 Abstract

Chain-of-Thought (CoT) reasoning empowers Large Language Models (LLMs) with remarkable capabilities but typically requires prohibitive parameter scales. CoT distillation has emerged as a promising paradigm to transfer reasoning prowess into compact Student Models (SLMs), but existing approaches often rely on a solitary teacher, capping the student's potential since individual LLMs often exhibit distinct capability biases and may suffer from catastrophic forgetting. While leveraging diverse teachers seems appealing, effectively fusing their supervisions remains challenging: teacher-student incompatibility risks amplifying hallucinations, and passive supervision fails to ensure genuine logic internalization. To address this, we introduce COMPACT, a framework that adaptively fuses supervisions from different teachers by dynamically weighting teacher gradients based on the student's real-time compatibility evaluated by a multi-dimensional metric: (1) Graph-based Consensus to filter misleading rationales by identifying mainstream reasoning paths; (2) Mutual-Information-based Adaptability to detect"epiphany moments"for genuinely understanding the reasoning process rather than merely imitating; and (3) Loss-based Difficulty to assess student receptivity to the teacher's guidance and prevent negative transfer. Extensive experiments and latent space analysis demonstrate that COMPACT effectively integrates diverse reasoning capabilities without damaging the model's original knowledge structure, achieving state-of-the-art performance on various benchmarks while mitigating catastrophic forgetting.

Problem

Research questions and friction points this paper is trying to address.

Chain-of-Thought Distillation

Multi-Teacher Learning

Teacher-Student Compatibility

Catastrophic Forgetting

Reasoning Capability Transfer

Innovation

Methods, ideas, or system contributions that make the work stand out.

Chain-of-Thought Distillation

Multi-Teacher Learning

Compatibility-Aware Fusion