🤖 AI Summary
This study identifies a critical failure mode of multimodal large language models (MLLMs) in medical decision-making: on Alzheimer’s disease staging (NCI/MCI/dementia) and MIMIC-CXR’s 14-class multi-label chest X-ray diagnosis, pure text-based reasoning outperforms vision-only or vision–language fusion by 5–12%, revealing pervasive “visual interference.” The authors first systematically diagnose “insufficient visual grounding” as the root cause. To mitigate it, they propose three strategies: (1) chain-of-thought prompting with reasoning-annotated examples; (2) converting image descriptions into textual inputs for subsequent language-only inference; and (3) few-shot supervised fine-tuning of the visual encoder. Experiments show that the image-description–to–text-reasoning pipeline bridges the performance gap, bringing multimodal accuracy close to the text-only upper bound. These findings establish a new conceptual framework for medical multimodal modeling and provide reproducible, generalizable technical pathways to enhance visual grounding in clinical MLLMs.
📝 Abstract
With the rapid progress of large language models (LLMs), advanced multimodal large language models (MLLMs) have demonstrated impressive zero-shot capabilities on vision-language tasks. In the biomedical domain, however, even state-of-the-art MLLMs struggle with basic Medical Decision Making (MDM) tasks. We investigate this limitation using two challenging datasets: (1) three-stage Alzheimer's disease (AD) classification (normal, mild cognitive impairment, dementia), where category differences are visually subtle, and (2) MIMIC-CXR chest radiograph classification with 14 non-mutually exclusive conditions. Our empirical study shows that text-only reasoning consistently outperforms vision-only or vision-text settings, with multimodal inputs often performing worse than text alone. To mitigate this, we explore three strategies: (1) in-context learning with reason-annotated exemplars, (2) vision captioning followed by text-only inference, and (3) few-shot fine-tuning of the vision tower with classification supervision. These findings reveal that current MLLMs lack grounded visual understanding and point to promising directions for improving multimodal decision making in healthcare.