🤖 AI Summary
This work addresses the limitations of existing chain-of-thought (CoT) distillation methods, which rely on a single teacher model and are thus prone to its capability biases and catastrophic forgetting, hindering the full realization of student models’ reasoning potential. To overcome this, we propose COMPACT, a multi-teacher collaborative distillation framework that dynamically integrates supervision signals compatible with the student’s evolving capabilities. COMPACT introduces a multidimensional compatibility assessment mechanism: it filters erroneous reasoning paths via graph-based consensus, identifies informative teaching moments through mutual information, and mitigates negative transfer by modulating loss difficulty. By integrating graph-structured analysis, mutual information estimation, and dynamic gradient weighting, COMPACT constructs a compatibility-aware distillation system. Experiments demonstrate that COMPACT achieves state-of-the-art performance across multiple reasoning benchmarks, effectively enhancing small language models’ reasoning abilities while alleviating catastrophic forgetting and preserving their original knowledge.
📝 Abstract
Chain-of-Thought (CoT) reasoning empowers Large Language Models (LLMs) with remarkable capabilities but typically requires prohibitive parameter scales. CoT distillation has emerged as a promising paradigm to transfer reasoning prowess into compact Student Models (SLMs), but existing approaches often rely on a solitary teacher, capping the student's potential since individual LLMs often exhibit distinct capability biases and may suffer from catastrophic forgetting. While leveraging diverse teachers seems appealing, effectively fusing their supervisions remains challenging: teacher-student incompatibility risks amplifying hallucinations, and passive supervision fails to ensure genuine logic internalization. To address this, we introduce COMPACT, a framework that adaptively fuses supervisions from different teachers by dynamically weighting teacher gradients based on the student's real-time compatibility evaluated by a multi-dimensional metric: (1) Graph-based Consensus to filter misleading rationales by identifying mainstream reasoning paths; (2) Mutual-Information-based Adaptability to detect"epiphany moments"for genuinely understanding the reasoning process rather than merely imitating; and (3) Loss-based Difficulty to assess student receptivity to the teacher's guidance and prevent negative transfer. Extensive experiments and latent space analysis demonstrate that COMPACT effectively integrates diverse reasoning capabilities without damaging the model's original knowledge structure, achieving state-of-the-art performance on various benchmarks while mitigating catastrophic forgetting.