RAD: Towards Trustworthy Retrieval-Augmented Multi-modal Clinical Diagnosis

📅 2025-09-24

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

Current AI-based medical models predominantly rely on implicit parameterized knowledge encoding, limiting their adaptability to diverse downstream diagnostic tasks. To address this, we propose RAD (Retrieval-Augmented Diagnosis), a novel framework featuring: (1) task-oriented external medical knowledge retrieval and refinement; (2) guideline-driven cross-modal fusion, wherein clinical practice guidelines serve as structured queries to jointly process imaging and textual inputs; and (3) guideline-enhanced contrastive loss coupled with dual Transformer decoders to ensure clinical alignment from knowledge retrieval to decision generation. Furthermore, we introduce the first quantitative, interpretability-aware evaluation benchmark specifically designed for multimodal diagnostic models. Evaluated across four anatomical-site datasets, RAD achieves state-of-the-art performance, significantly improving attention to pathological regions and critical clinical indicators—enabling evidence-traceable, interpretable, and trustworthy diagnosis.

Technology Category

Application Category

📝 Abstract

Clinical diagnosis is a highly specialized discipline requiring both domain expertise and strict adherence to rigorous guidelines. While current AI-driven medical research predominantly focuses on knowledge graphs or natural text pretraining paradigms to incorporate medical knowledge, these approaches primarily rely on implicitly encoded knowledge within model parameters, neglecting task-specific knowledge required by diverse downstream tasks. To address this limitation, we propose Retrieval-Augmented Diagnosis (RAD), a novel framework that explicitly injects external knowledge into multimodal models directly on downstream tasks. Specifically, RAD operates through three key mechanisms: retrieval and refinement of disease-centered knowledge from multiple medical sources, a guideline-enhanced contrastive loss that constrains the latent distance between multi-modal features and guideline knowledge, and the dual transformer decoder that employs guidelines as queries to steer cross-modal fusion, aligning the models with clinical diagnostic workflows from guideline acquisition to feature extraction and decision-making. Moreover, recognizing the lack of quantitative evaluation of interpretability for multimodal diagnostic models, we introduce a set of criteria to assess the interpretability from both image and text perspectives. Extensive evaluations across four datasets with different anatomies demonstrate RAD's generalizability, achieving state-of-the-art performance. Furthermore, RAD enables the model to concentrate more precisely on abnormal regions and critical indicators, ensuring evidence-based, trustworthy diagnosis. Our code is available at https://github.com/tdlhl/RAD.

Problem

Research questions and friction points this paper is trying to address.

Addresses limitations of implicitly encoded medical knowledge in AI diagnosis

Proposes explicit injection of external knowledge into multimodal diagnostic models

Lacks quantitative evaluation of interpretability for multimodal diagnostic models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Retrieves disease knowledge from multiple medical sources

Uses guideline-enhanced contrastive loss for feature alignment

Employs dual transformer decoder for clinical workflow alignment

🔎 Similar Papers

Alifuse: Aligning and Fusing Multimodal Medical Data for Computer-Aided Diagnosis