Why Text Prevails: Vision May Undermine Multimodal Medical Decision Making

📅 2025-12-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study identifies a critical failure mode of multimodal large language models (MLLMs) in medical decision-making: on Alzheimer’s disease staging (NCI/MCI/dementia) and MIMIC-CXR’s 14-class multi-label chest X-ray diagnosis, pure text-based reasoning outperforms vision-only or vision–language fusion by 5–12%, revealing pervasive “visual interference.” The authors first systematically diagnose “insufficient visual grounding” as the root cause. To mitigate it, they propose three strategies: (1) chain-of-thought prompting with reasoning-annotated examples; (2) converting image descriptions into textual inputs for subsequent language-only inference; and (3) few-shot supervised fine-tuning of the visual encoder. Experiments show that the image-description–to–text-reasoning pipeline bridges the performance gap, bringing multimodal accuracy close to the text-only upper bound. These findings establish a new conceptual framework for medical multimodal modeling and provide reproducible, generalizable technical pathways to enhance visual grounding in clinical MLLMs.

Technology Category

Application Category

📝 Abstract
With the rapid progress of large language models (LLMs), advanced multimodal large language models (MLLMs) have demonstrated impressive zero-shot capabilities on vision-language tasks. In the biomedical domain, however, even state-of-the-art MLLMs struggle with basic Medical Decision Making (MDM) tasks. We investigate this limitation using two challenging datasets: (1) three-stage Alzheimer's disease (AD) classification (normal, mild cognitive impairment, dementia), where category differences are visually subtle, and (2) MIMIC-CXR chest radiograph classification with 14 non-mutually exclusive conditions. Our empirical study shows that text-only reasoning consistently outperforms vision-only or vision-text settings, with multimodal inputs often performing worse than text alone. To mitigate this, we explore three strategies: (1) in-context learning with reason-annotated exemplars, (2) vision captioning followed by text-only inference, and (3) few-shot fine-tuning of the vision tower with classification supervision. These findings reveal that current MLLMs lack grounded visual understanding and point to promising directions for improving multimodal decision making in healthcare.
Problem

Research questions and friction points this paper is trying to address.

MLLMs struggle with visually subtle medical classification tasks
Multimodal inputs underperform text-only reasoning in medical decision making
Current MLLMs lack grounded visual understanding for healthcare applications
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision captioning for text-only inference
Few-shot fine-tuning of vision tower
In-context learning with reason-annotated exemplars
🔎 Similar Papers
No similar papers found.
S
Siyuan Dai
Department of Electrical & Computer Engineering, University of Pittsburgh, Pittsburgh, PA, USA
L
Lunxiao Li
Department of Computer Science, NC State University, Raleigh, NC, USA
K
Kun Zhao
Department of Electrical & Computer Engineering, University of Pittsburgh, Pittsburgh, PA, USA
E
Eardi Lila
Department of Biostatistics, University of Washington, Seattle, WA, USA
P
Paul K. Crane
Department of Medicine, University of Washington, Seattle, WA, USA
Heng Huang
Heng Huang
Brendan Iribe Endowed Professor in Computer Science, University Maryland College Park
Machine LearningAIBiomedical Data ScienceComputer Vision
D
Dongkuan Xu
Department of Computer Science, University of Texas Rio Grande Valley, Edinburg, TX, USA
Haoteng Tang
Haoteng Tang
Assistant Professor in Computer Science, University of Texas Rio Grande Valley.
machine learningdata miningmedical image computing and bioinformatics
Liang Zhan
Liang Zhan
Associate Professor, Deps. of ECE and BioE, University of Pittsburgh
Medical Signal ModelingNeuroimagingComputational NeuroscienceMachine LearningBioinformatics