"The Whole Is Greater Than the Sum of Its Parts": A Compatibility-Aware Multi-Teacher CoT Distillation Framework

📅 2026-01-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of existing chain-of-thought (CoT) distillation methods, which rely on a single teacher model and are thus prone to its capability biases and catastrophic forgetting, hindering the full realization of student models’ reasoning potential. To overcome this, we propose COMPACT, a multi-teacher collaborative distillation framework that dynamically integrates supervision signals compatible with the student’s evolving capabilities. COMPACT introduces a multidimensional compatibility assessment mechanism: it filters erroneous reasoning paths via graph-based consensus, identifies informative teaching moments through mutual information, and mitigates negative transfer by modulating loss difficulty. By integrating graph-structured analysis, mutual information estimation, and dynamic gradient weighting, COMPACT constructs a compatibility-aware distillation system. Experiments demonstrate that COMPACT achieves state-of-the-art performance across multiple reasoning benchmarks, effectively enhancing small language models’ reasoning abilities while alleviating catastrophic forgetting and preserving their original knowledge.

Technology Category

Application Category

📝 Abstract
Chain-of-Thought (CoT) reasoning empowers Large Language Models (LLMs) with remarkable capabilities but typically requires prohibitive parameter scales. CoT distillation has emerged as a promising paradigm to transfer reasoning prowess into compact Student Models (SLMs), but existing approaches often rely on a solitary teacher, capping the student's potential since individual LLMs often exhibit distinct capability biases and may suffer from catastrophic forgetting. While leveraging diverse teachers seems appealing, effectively fusing their supervisions remains challenging: teacher-student incompatibility risks amplifying hallucinations, and passive supervision fails to ensure genuine logic internalization. To address this, we introduce COMPACT, a framework that adaptively fuses supervisions from different teachers by dynamically weighting teacher gradients based on the student's real-time compatibility evaluated by a multi-dimensional metric: (1) Graph-based Consensus to filter misleading rationales by identifying mainstream reasoning paths; (2) Mutual-Information-based Adaptability to detect"epiphany moments"for genuinely understanding the reasoning process rather than merely imitating; and (3) Loss-based Difficulty to assess student receptivity to the teacher's guidance and prevent negative transfer. Extensive experiments and latent space analysis demonstrate that COMPACT effectively integrates diverse reasoning capabilities without damaging the model's original knowledge structure, achieving state-of-the-art performance on various benchmarks while mitigating catastrophic forgetting.
Problem

Research questions and friction points this paper is trying to address.

Chain-of-Thought Distillation
Multi-Teacher Learning
Teacher-Student Compatibility
Catastrophic Forgetting
Reasoning Capability Transfer
Innovation

Methods, ideas, or system contributions that make the work stand out.

Chain-of-Thought Distillation
Multi-Teacher Learning
Compatibility-Aware Fusion
Catastrophic Forgetting Mitigation
Adaptive Gradient Weighting
🔎 Similar Papers
No similar papers found.
Jin Cui
Jin Cui
Principal Engineer
Embedded SystemOS Kernel & DriverHypervisor & VirtualizationComputer uArch modellingFPGA & EDA
J
Jiaqi Guo
Nankai University
J
Jiepeng Zhou
The Hong Kong University of Science and Technology(Guangzhou)
R
Ruixuan Yang
State Key Laboratory of Human-Machine Hybrid Augmented Intelligence, and Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University
Jiayi Lu
Jiayi Lu
Beihang University
Autonomous VehicleComputer VisionSOTIFADAS
J
Jiajun Xu
School of Software Engineering, Xi’an Jiaotong University
J
Jiangcheng Song
State Key Laboratory of Human-Machine Hybrid Augmented Intelligence, and Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University
B
Boran Zhao
School of Software Engineering, Xi’an Jiaotong University
Pengju Ren
Pengju Ren
Professor, Xi'an Jiaotong University