Cross-Modal Distillation For Widely Differing Modalities

📅 2025-07-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address overfitting in knowledge distillation caused by modality heterogeneity in multimodal learning, this paper proposes a teacher-student framework tailored for discriminative cross-modal knowledge transfer. Methodologically, it integrates cross-modal knowledge distillation, joint feature-classifier alignment, and dynamic sample weighting—without requiring strong inter-modal alignment assumptions. Its key contributions are: (1) a two-level soft-constraint distillation strategy that jointly aligns heterogeneous modalities in both feature space and classifier output space; and (2) a data-quality-aware adaptive sample weighting mechanism to enhance model robustness. Evaluated on speaker identification and image classification tasks, the method significantly improves cross-modal knowledge transfer efficiency and generalization across vision, language, and speech modalities. Notably, it demonstrates superior robustness on low-quality samples, validating its effectiveness under realistic, noisy conditions.

Technology Category

Application Category

📝 Abstract
Deep learning achieved great progress recently, however, it is not easy or efficient to further improve its performance by increasing the size of the model. Multi-modal learning can mitigate this challenge by introducing richer and more discriminative information as input. To solve the problem of limited access to multi-modal data at the time of use, we conduct multi-modal learning by introducing a teacher model to transfer discriminative knowledge to a student model during training. However, this knowledge transfer via distillation is not trivial because the big domain gap between the widely differing modalities can easily lead to overfitting. In this work, we introduce a cross-modal distillation framework. Specifically, we find hard constrained loss, e.g. l2 loss forcing the student being exact the same as the teacher, can easily lead to overfitting in cross-modality distillation. To address this, we propose two soft constrained knowledge distillation strategies at the feature level and classifier level respectively. In addition, we propose a quality-based adaptive weights module to weigh input samples via quantified data quality, leading to robust model training. We conducted experiments on speaker recognition and image classification tasks, and the results show that our approach is able to effectively achieve knowledge transfer between the commonly used and widely differing modalities of image, text, and speech.
Problem

Research questions and friction points this paper is trying to address.

Transfer knowledge between widely differing modalities
Prevent overfitting in cross-modal distillation
Improve robustness via quality-based adaptive weights
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-modal distillation for differing modalities
Soft constrained knowledge distillation strategies
Quality-based adaptive weights for robust training
🔎 Similar Papers
No similar papers found.
Cairong Zhao
Cairong Zhao
Tongji University
deep learningcomputer visionperson re-id
Y
Yufeng Jin
Department of Computer Science & Technology, Tongji University, Shanghai 201804, China
Zifan Song
Zifan Song
Tongji University
Multimodal LearningData-centric AILarge Language Model
H
Haonan Chen
Alibaba Group, Hangzhou 310000, China
D
Duoqian Miao
Department of Computer Science & Technology, Tongji University, Shanghai 201804, China
G
Guosheng Hu
Oosto, Belfast BT1 2BE, UK