VisTA: Vision-Text Alignment Model with Contrastive Learning using Multimodal Data for Evidence-Driven, Reliable, and Explainable Alzheimer's Disease Diagnosis

📅 2025-02-03

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

Addressing the challenge of balancing model interpretability and few-shot generalization in Alzheimer’s disease (AD) clinical diagnosis, this paper proposes VisTA—a vision–text alignment model built upon the BiomedCLIP architecture. VisTA introduces a novel lightweight fine-tuning paradigm infused with medical prior knowledge, requiring only 170 annotated samples. It employs multimodal contrastive learning to semantically align neuroimaging data with expert-generated abnormality descriptions and supports evidence-traceable reasoning. Experiments demonstrate substantial improvements: abnormality retrieval accuracy increases by 48 percentage points to 74%; AD prediction achieves an AUC of 0.87 (+0.13); dementia classification accuracy reaches 88% (+58 pp), with AUC = 0.82 (+0.25); and generated explanations exhibit strong correlation with expert judgments. To our knowledge, this is the first work to simultaneously achieve high diagnostic accuracy, strong interpretability, and verifiability for AD in a few-shot medical setting.

Technology Category

Application Category

📝 Abstract

Objective: Assessing Alzheimer's disease (AD) using high-dimensional radiology images is clinically important but challenging. Although Artificial Intelligence (AI) has advanced AD diagnosis, it remains unclear how to design AI models embracing predictability and explainability. Here, we propose VisTA, a multimodal language-vision model assisted by contrastive learning, to optimize disease prediction and evidence-based, interpretable explanations for clinical decision-making. Methods: We developed VisTA (Vision-Text Alignment Model) for AD diagnosis. Architecturally, we built VisTA from BiomedCLIP and fine-tuned it using contrastive learning to align images with verified abnormalities and their descriptions. To train VisTA, we used a constructed reference dataset containing images, abnormality types, and descriptions verified by medical experts. VisTA produces four outputs: predicted abnormality type, similarity to reference cases, evidence-driven explanation, and final AD diagnoses. To illustrate VisTA's efficacy, we reported accuracy metrics for abnormality retrieval and dementia prediction. To demonstrate VisTA's explainability, we compared its explanations with human experts' explanations. Results: Compared to 15 million images used for baseline pretraining, VisTA only used 170 samples for fine-tuning and obtained significant improvement in abnormality retrieval and dementia prediction. For abnormality retrieval, VisTA reached 74% accuracy and an AUC of 0.87 (26% and 0.74, respectively, from baseline models). For dementia prediction, VisTA achieved 88% accuracy and an AUC of 0.82 (30% and 0.57, respectively, from baseline models). The generated explanations agreed strongly with human experts' and provided insights into the diagnostic process. Taken together, VisTA optimize prediction, clinical reasoning, and explanation.

Problem

Research questions and friction points this paper is trying to address.

Alzheimer's Disease Prediction

Explainable AI

Medical Diagnosis Assistance

Innovation

Methods, ideas, or system contributions that make the work stand out.

VisTA Model

Contrastive Learning

Alzheimer's Diagnosis

🔎 Similar Papers

Alifuse: Aligning and Fusing Multimodal Medical Data for Computer-Aided Diagnosis