Unifying Biomedical Vision-Language Expertise: Towards a Generalist Foundation Model via Multi-CLIP Knowledge Distillation

📅 2025-06-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Biomedical image-text corpora suffer from scarcity, strong modality heterogeneity, and fragmented data standards. Method: We propose MMKD-CLIP, a multi-teacher knowledge distillation framework that—uniquely—fuses feature-level knowledge from nine domain-specific CLIP models. It comprises two stages: (i) CLIP pretraining on 2.9M image-text pairs across 26 modalities; followed by (ii) cross-modal knowledge distillation using 19.2M teacher-generated feature pairs. Contribution/Results: Evaluated on 58 datasets encompassing nine imaging modalities and 10.8M images, MMKD-CLIP consistently outperforms all individual teacher models across six downstream tasks. It significantly enhances generalization across institutions, modalities, and tasks, establishing a scalable new paradigm for universal biomedical vision-language foundation models.

Technology Category

Application Category

📝 Abstract
CLIP models pretrained on natural images with billion-scale image-text pairs have demonstrated impressive capabilities in zero-shot classification, cross-modal retrieval, and open-ended visual answering. However, transferring this success to biomedicine is hindered by the scarcity of large-scale biomedical image-text corpora, the heterogeneity of image modalities, and fragmented data standards across institutions. These limitations hinder the development of a unified and generalizable biomedical foundation model trained from scratch. To overcome this, we introduce MMKD-CLIP, a generalist biomedical foundation model developed via Multiple Medical CLIP Knowledge Distillation. Rather than relying on billion-scale raw data, MMKD-CLIP distills knowledge from nine state-of-the-art domain-specific or generalist biomedical CLIP models, each pretrained on millions of biomedical image-text pairs. Our two-stage training pipeline first performs CLIP-style pretraining on over 2.9 million biomedical image-text pairs from 26 image modalities, followed by feature-level distillation using over 19.2 million feature pairs extracted from teacher models. We evaluate MMKD-CLIP on 58 diverse biomedical datasets, encompassing over 10.8 million biomedical images across nine image modalities. The evaluation spans six core task types: zero-shot classification, linear probing, cross-modal retrieval, visual question answering, survival prediction, and cancer diagnosis. MMKD-CLIP consistently outperforms all teacher models while demonstrating remarkable robustness and generalization across image domains and task settings. These results underscore that multi-teacher knowledge distillation is a scalable and effective paradigm for building high-performing biomedical foundation models under the practical constraints of real-world data availability.
Problem

Research questions and friction points this paper is trying to address.

Lack of large-scale biomedical image-text corpora
Heterogeneity of biomedical image modalities
Fragmented data standards across institutions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-CLIP knowledge distillation for biomedical model
Two-stage training with pretraining and feature distillation
Utilizes 19.2M feature pairs from nine teacher models
🔎 Similar Papers
No similar papers found.
Shansong Wang
Shansong Wang
Postdoctoral Research Fellow at Emory University
computer visionmultimodal learningfoundation model
Z
Zhecheng Jin
Department of Biomedical Engineering, College of Engineering, Georgia Institute of Technology
M
Mingzhe Hu
Department of Radiation Oncology, Winship Cancer Institute, Emory University School of Medicine; Department of Computer Science and Mathematics, Laney Graduate School, Emory University
Mojtaba Safari
Mojtaba Safari
Postdoctoral Fellow, Emory University
Medical PhysicsMRIMedical Image Analysis
F
Feng Zhao
School of Electrical and Computer Engineering, College of Engineering, Georgia Institute of Technology
Chih-Wei Chang
Chih-Wei Chang
Emory University
Physics-Informed Machine LearningDigital TwinsGAIProton TherapyRadiotherapy
R
Richard LJ Qiu
Department of Radiation Oncology, Winship Cancer Institute, Emory University School of Medicine
J
Justin Roper
Department of Radiation Oncology, Winship Cancer Institute, Emory University School of Medicine
D
David S. Yu
Department of Radiation Oncology, Winship Cancer Institute, Emory University School of Medicine
X
Xiaofeng Yang
Department of Radiation Oncology, Winship Cancer Institute, Emory University School of Medicine; Department of Biomedical Engineering, College of Engineering, Georgia Institute of Technology; Department of Computer Science and Mathematics, Laney Graduate School, Emory University