Medical Context Distorts Decisions in Clinical Vision Language Models

📅 2026-05-17

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

This study addresses the reliability limitations of clinical vision-language models (VLMs) in diagnostic settings, where predictions are often unduly influenced by irrelevant textual information or variations in prompt phrasing. Leveraging the MIMIC-CXR dataset, the authors systematically manipulate image-text alignment, clinical history content, and prompt formulation to identify and validate three critical failure modes: excessive reliance on textual inputs even when visual evidence is sufficient, susceptibility to misleading irrelevant patient histories, and prediction instability under semantically equivalent prompts. Evaluations across both open-source and proprietary VLMs demonstrate a consistent and significant bias toward text over image data, with minor prompt perturbations frequently reversing otherwise correct diagnoses. These findings underscore substantial robustness risks that challenge the clinical deployability of current VLMs.

📝 Abstract

Vision-language models (VLMs) are increasingly proposed for clinical decision support, yet their reliability in real-world scenarios that require integrating both visual and textual context from medical records remains poorly characterized. This paper identifies three failure modes: (1) modality over-reliance on text over images, (2) spurious reliance on irrelevant clinical history, and (3) prompt sensitivity across semantically equivalent inputs. We evaluate a diverse set of general-domain and medically-tuned open and closed VLMs on chest x-ray tasks using MIMIC-CXR. By systematically manipulating image-text alignment, clinical history, and prompt formulations, we found that VLM decisions are dominated by the text modality, even when visual evidence is available. Moreover, we observed that VLMs are heavily influenced by irrelevant reports, while minor prompt changes can reverse correct image-based predictions. Our findings underscore the need for explicit safeguards and stress-testing before considering the use of these models in clinical practice.

Problem

Research questions and friction points this paper is trying to address.

clinical vision-language models

modality over-reliance

spurious correlation

prompt sensitivity

medical decision support

Innovation

Methods, ideas, or system contributions that make the work stand out.

vision-language models

clinical decision support

modality bias