Towards Responsible Multimodal Medical Reasoning via Context-Aligned Vision-Language Models

📅 2026-04-09
📈 Citations: 0
Influential: 0
📄 PDF

career value

166K/year
🤖 AI Summary
This work addresses the susceptibility of existing medical vision-language models to hallucinated diagnoses in radiology due to overreliance on a single modality. To mitigate this, the authors propose a context-aligned reasoning framework that enforces consensus among multiple clinical evidence sources—including radiomics statistics, explainable activation maps, and lexical semantic cues—before generating a diagnostic statement. The framework guides a frozen vision-language model to produce structured outputs explicitly incorporating supporting evidence, uncertainty estimates, limitations, and safety disclaimers. Rather than merely fusing modalities, the approach employs a context-validation mechanism to enable synergistic multi-evidence reasoning, substantially enhancing reliability and transparency. Experiments on chest X-ray data demonstrate improved performance (AUC increased from 0.918 to 0.925), a 78% reduction in hallucination-related keywords (from 1.14 to 0.25 per report), more concise outputs (19.4 to 15.3 words), and well-calibrated uncertainty with stable confidence scores.

Technology Category

Application Category

📝 Abstract
Medical vision-language models (VLMs) show strong performance on radiology tasks but often produce fluent yet weakly grounded conclusions due to over-reliance on a dominant modality. We introduce a context-aligned reasoning framework that enforces agreement across heterogeneous clinical evidence before generating diagnostic conclusions. The proposed approach augments a frozen VLM with structured contextual signals derived from radiomic statistics, explainability activations, and vocabulary-grounded semantic cues. Instead of producing free-form responses, the model generates structured outputs containing supporting evidence, uncertainty estimates, limitations, and safety notes. We observe that auxiliary signals alone provide limited benefit; performance gains emerge only when these signals are integrated through contextual verification. Experiments on chest X-ray datasets demonstrate that context alignment improves discriminative performance (AUC 0.918 to 0.925) while maintaining calibrated uncertainty. The framework also substantially reduces hallucinated keywords (1.14 to 0.25) and produces more concise reasoning explanations (19.4 to 15.3 words) without increasing model confidence (0.70 to 0.68). Cross-dataset evaluation on CheXpert further reveals that modality informativeness significantly influences reasoning behavior. These results suggest that enforcing multi-evidence agreement improves both reliability and trustworthiness in medical multimodal reasoning, while preserving the underlying model architecture.
Problem

Research questions and friction points this paper is trying to address.

medical vision-language models
multimodal reasoning
hallucination
modality imbalance
diagnostic reliability
Innovation

Methods, ideas, or system contributions that make the work stand out.

context-aligned reasoning
vision-language models
multimodal medical reasoning
structured output generation
evidence agreement
🔎 Similar Papers
No similar papers found.
S
Sumra Khan
Department of Computer Science, Salim Habib University, Karachi, Pakistan
S
Sagar Chhabriya
Computer Science, Institute of Business Administration Sukkur, Pakistan
A
Aizan Zafar
Center for Research in Computer Vision, University of Central Florida, USA
S
Sheeraz Arif
Department of Computer Science, Salim Habib University, Karachi, Pakistan
A
Amgad Muneer
The University of Texas MD Anderson Cancer Center, USA
A
Anas Zafar
The University of Texas MD Anderson Cancer Center, USA
S
Shaina Raza
Toronto Metropolitan University, Vector Institute, Canada
Rizwan Qureshi
Rizwan Qureshi
Center for Research in Computer Vision (CRCV), University of Central Florida, Orlando, USA
Cancer Data ScienceResponsible AIComputer VisionBioinformaticsMachine Learning