🤖 AI Summary
Clinical decision-making requires joint analysis of medical images and textual reports; however, prevailing vision-language models (VLMs) suffer from text modality bias, neglecting critical visual cues and increasing misdiagnosis risk. This work systematically identifies, for the first time, pronounced text-dominant bias in mainstream open-source VLMs across chest X-ray and fundus image tasks. We propose Selective Modality Shifting (SMS), a framework integrating perturbation analysis, cross-sample modality swapping, attention visualization, and calibration-based evaluation to quantitatively characterize model reliance on each modality. Experiments reveal that all evaluated VLMs exhibit excessive dependence on textual inputs—even when images provide unambiguous diagnostic evidence—leading to erroneous predictions under misleading reports. Our study establishes the first empirical benchmark for text bias in multimodal clinical AI and introduces an interpretable, quantitative framework for diagnosing modality imbalance. This advances the development of robust, balanced clinical VLMs.
📝 Abstract
Clinical decision-making relies on the integrated analysis of medical images and the associated clinical reports. While Vision-Language Models (VLMs) can offer a unified framework for such tasks, they can exhibit strong biases toward one modality, frequently overlooking critical visual cues in favor of textual information. In this work, we introduce Selective Modality Shifting (SMS), a perturbation-based approach to quantify a model's reliance on each modality in binary classification tasks. By systematically swapping images or text between samples with opposing labels, we expose modality-specific biases. We assess six open-source VLMs-four generalist models and two fine-tuned for medical data-on two medical imaging datasets with distinct modalities: MIMIC-CXR (chest X-ray) and FairVLMed (scanning laser ophthalmoscopy). By assessing model performance and the calibration of every model in both unperturbed and perturbed settings, we reveal a marked dependency on text input, which persists despite the presence of complementary visual information. We also perform a qualitative attention-based analysis which further confirms that image content is often overshadowed by text details. Our findings highlight the importance of designing and evaluating multimodal medical models that genuinely integrate visual and textual cues, rather than relying on single-modality signals.