MST-Distill: Mixture of Specialized Teachers for Cross-Modal Knowledge Distillation

📅 2025-07-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address insufficient exploitation of complementary prior knowledge, suboptimal distillation path selection, and knowledge drift—arising from data and statistical heterogeneity in cross-modal knowledge distillation—this paper proposes a dynamic adaptive distillation framework. Methodologically, it introduces (1) an instance-level routing network that automatically selects the optimal teacher modality combination per sample, and (2) a plug-and-play masking module, trained independently to suppress modality-specific discrepancies and reconstruct teacher representations. The framework enables flexible integration of both cross-modal and multi-modal teacher models. Evaluated on five benchmark datasets spanning vision, audio, and text modalities, it consistently outperforms state-of-the-art methods, demonstrating superior knowledge transfer efficiency and representation consistency.

Technology Category

Application Category

📝 Abstract
Knowledge distillation as an efficient knowledge transfer technique, has achieved remarkable success in unimodal scenarios. However, in cross-modal settings, conventional distillation methods encounter significant challenges due to data and statistical heterogeneities, failing to leverage the complementary prior knowledge embedded in cross-modal teacher models. This paper empirically reveals two critical issues in existing approaches: distillation path selection and knowledge drift. To address these limitations, we propose MST-Distill, a novel cross-modal knowledge distillation framework featuring a mixture of specialized teachers. Our approach employs a diverse ensemble of teacher models across both cross-modal and multimodal configurations, integrated with an instance-level routing network that facilitates adaptive and dynamic distillation. This architecture effectively transcends the constraints of traditional methods that rely on monotonous and static teacher models. Additionally, we introduce a plug-in masking module, independently trained to suppress modality-specific discrepancies and reconstruct teacher representations, thereby mitigating knowledge drift and enhancing transfer effectiveness. Extensive experiments across five diverse multimodal datasets, spanning visual, audio, and text, demonstrate that our method significantly outperforms existing state-of-the-art knowledge distillation methods in cross-modal distillation tasks. The source code is available at https://github.com/Gray-OREO/MST-Distill.
Problem

Research questions and friction points this paper is trying to address.

Addresses cross-modal knowledge distillation challenges
Solves distillation path selection and knowledge drift
Enhances transfer with specialized teacher mixtures
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture of specialized teachers for cross-modal distillation
Instance-level routing network for adaptive distillation
Plug-in masking module to suppress modality discrepancies
🔎 Similar Papers
No similar papers found.
H
Hui Li
Xidian University, Xi’an, China
Pengfei Yang
Pengfei Yang
Institute of Software, Chinese Academy of Sciences
Probabilistic model checkingDNN verification
J
Juanyang Chen
Xidian University, Xi’an, China
L
Le Dong
Xidian University, Xi’an, China
Y
Yanxin Chen
Xidian University, Xi’an, China
Q
Quan Wang
Xidian University, Xi’an, China