ConceptCLIP: Towards Trustworthy Medical AI via Concept-Enhanced Contrastive Langauge-Image Pre-training

📅 2025-01-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the longstanding trade-off between accuracy and interpretability in medical AI, this paper introduces the first vision-language pretraining framework designed specifically for trustworthy medical AI. Methodologically, it proposes a novel dual-alignment paradigm—jointly optimizing image-text alignment and image-patch–medical-concept alignment—grounded in the Unified Medical Language System (UMLS). We construct MedConcept-23M, a large-scale, UMLS-enhanced multimodal dataset comprising 23 million medical image–text pairs, and integrate medical ontology knowledge via fine-grained Patch-Concept Alignment (PC-Align) and multimodal Image-Text Alignment (IT-Align). Evaluated across 51 downstream tasks spanning 10 imaging modalities and five clinical categories, our model consistently outperforms state-of-the-art baselines. Notably, concept localization accuracy improves by an average of 12.7% across six modalities, substantially enhancing decision interpretability and clinical trustworthiness.

Technology Category

Application Category

📝 Abstract
Trustworthiness is essential for the precise and interpretable application of artificial intelligence (AI) in medical imaging. Traditionally, precision and interpretability have been addressed as separate tasks, namely medical image analysis and explainable AI, each developing its own models independently. In this study, for the first time, we investigate the development of a unified medical vision-language pre-training model that can achieve both accurate analysis and interpretable understanding of medical images across various modalities. To build the model, we construct MedConcept-23M, a large-scale dataset comprising 23 million medical image-text pairs extracted from 6.2 million scientific articles, enriched with concepts from the Unified Medical Language System (UMLS). Based on MedConcept-23M, we introduce ConceptCLIP, a medical AI model utilizing concept-enhanced contrastive language-image pre-training. The pre-training of ConceptCLIP involves two primary components: image-text alignment learning (IT-Align) and patch-concept alignment learning (PC-Align). This dual alignment strategy enhances the model's capability to associate specific image regions with relevant concepts, thereby improving both the precision of analysis and the interpretability of the AI system. We conducted extensive experiments on 5 diverse types of medical image analysis tasks, spanning 51 subtasks across 10 image modalities, with the broadest range of downstream tasks. The results demonstrate the effectiveness of the proposed vision-language pre-training model. Further explainability analysis across 6 modalities reveals that ConceptCLIP achieves superior performance, underscoring its robust ability to advance explainable AI in medical imaging. These findings highlight ConceptCLIP's capability in promoting trustworthy AI in the field of medicine.
Problem

Research questions and friction points this paper is trying to address.

Medical Image Analysis
Artificial Intelligence
Explainability
Innovation

Methods, ideas, or system contributions that make the work stand out.

ConceptCLIP
Medical Image Analysis
Explainable AI
🔎 Similar Papers
No similar papers found.
Yuxiang Nie
Yuxiang Nie
Hong Kong University of Science and Technology
Natural language processingMulti-modal LearningMedical Image Analysis
Sunan He
Sunan He
Hong Kong University of Science and Technology
Multi-Modal Learning
Y
Yequan Bie
Department of Computer Science and Engineering, The Hong Kong University of Science and Technology, Hong Kong, China.
Y
Yihui Wang
Department of Computer Science and Engineering, The Hong Kong University of Science and Technology, Hong Kong, China.
Z
Zhixuan Chen
Department of Computer Science and Engineering, The Hong Kong University of Science and Technology, Hong Kong, China.
S
Shu Yang
Department of Computer Science and Engineering, The Hong Kong University of Science and Technology, Hong Kong, China.
H
Hao Chen
Department of Computer Science and Engineering, The Hong Kong University of Science and Technology, Hong Kong, China.; Department of Chemical and Biological Engineering, The Hong Kong University of Science and Technology, Hong Kong, China.; Division of Life Science, The Hong Kong University of Science and Technology, Hong Kong, China.; State Key Laboratory of Molecular Neuroscience, The Hong Kong University of Science and Technology, Hong Kong, China.; Shenzhen-Hong Kong Collaborative Innovation Research In