🤖 AI Summary
To address the longstanding trade-off between accuracy and interpretability in medical AI, this paper introduces the first vision-language pretraining framework designed specifically for trustworthy medical AI. Methodologically, it proposes a novel dual-alignment paradigm—jointly optimizing image-text alignment and image-patch–medical-concept alignment—grounded in the Unified Medical Language System (UMLS). We construct MedConcept-23M, a large-scale, UMLS-enhanced multimodal dataset comprising 23 million medical image–text pairs, and integrate medical ontology knowledge via fine-grained Patch-Concept Alignment (PC-Align) and multimodal Image-Text Alignment (IT-Align). Evaluated across 51 downstream tasks spanning 10 imaging modalities and five clinical categories, our model consistently outperforms state-of-the-art baselines. Notably, concept localization accuracy improves by an average of 12.7% across six modalities, substantially enhancing decision interpretability and clinical trustworthiness.
📝 Abstract
Trustworthiness is essential for the precise and interpretable application of artificial intelligence (AI) in medical imaging. Traditionally, precision and interpretability have been addressed as separate tasks, namely medical image analysis and explainable AI, each developing its own models independently. In this study, for the first time, we investigate the development of a unified medical vision-language pre-training model that can achieve both accurate analysis and interpretable understanding of medical images across various modalities. To build the model, we construct MedConcept-23M, a large-scale dataset comprising 23 million medical image-text pairs extracted from 6.2 million scientific articles, enriched with concepts from the Unified Medical Language System (UMLS). Based on MedConcept-23M, we introduce ConceptCLIP, a medical AI model utilizing concept-enhanced contrastive language-image pre-training. The pre-training of ConceptCLIP involves two primary components: image-text alignment learning (IT-Align) and patch-concept alignment learning (PC-Align). This dual alignment strategy enhances the model's capability to associate specific image regions with relevant concepts, thereby improving both the precision of analysis and the interpretability of the AI system. We conducted extensive experiments on 5 diverse types of medical image analysis tasks, spanning 51 subtasks across 10 image modalities, with the broadest range of downstream tasks. The results demonstrate the effectiveness of the proposed vision-language pre-training model. Further explainability analysis across 6 modalities reveals that ConceptCLIP achieves superior performance, underscoring its robust ability to advance explainable AI in medical imaging. These findings highlight ConceptCLIP's capability in promoting trustworthy AI in the field of medicine.