Your other Left! Vision-Language Models Fail to Identify Relative Positions in Medical Images

📅 2025-08-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Medical vision-language models (VLMs) exhibit severe deficiencies in clinically critical relative spatial reasoning—specifically, identifying the anatomical or pathological relationships between structures and abnormalities. To address this gap, we introduce MIRP, the first benchmark explicitly designed for evaluating relative localization in medical imaging. MIRP systematically assesses leading VLMs—including GPT-4o, Llama3.2, Pixtral, and JanusPro—on fine-grained spatial relation understanding. Experimental results reveal that these models heavily rely on textual priors rather than visual evidence, yielding substantially lower localization accuracy compared to general-domain counterparts. Moreover, augmenting inputs with alphanumeric or color-coded visual cues yields only marginal improvements, underscoring fundamental limitations in spatial reasoning. MIRP fills a critical void in the evaluation of fine-grained spatial comprehension for medical VLMs and establishes a standardized, empirically grounded testbed to advance research on spatially aware medical AI.

Technology Category

Application Category

📝 Abstract
Clinical decision-making relies heavily on understanding relative positions of anatomical structures and anomalies. Therefore, for Vision-Language Models (VLMs) to be applicable in clinical practice, the ability to accurately determine relative positions on medical images is a fundamental prerequisite. Despite its importance, this capability remains highly underexplored. To address this gap, we evaluate the ability of state-of-the-art VLMs, GPT-4o, Llama3.2, Pixtral, and JanusPro, and find that all models fail at this fundamental task. Inspired by successful approaches in computer vision, we investigate whether visual prompts, such as alphanumeric or colored markers placed on anatomical structures, can enhance performance. While these markers provide moderate improvements, results remain significantly lower on medical images compared to observations made on natural images. Our evaluations suggest that, in medical imaging, VLMs rely more on prior anatomical knowledge than on actual image content for answering relative position questions, often leading to incorrect conclusions. To facilitate further research in this area, we introduce the MIRP , Medical Imaging Relative Positioning, benchmark dataset, designed to systematically evaluate the capability to identify relative positions in medical images.
Problem

Research questions and friction points this paper is trying to address.

VLMs fail to identify relative positions in medical images
Visual prompts improve performance but remain insufficient
VLMs rely on prior knowledge over image content
Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluate VLMs on medical image positioning
Use visual prompts to enhance performance
Introduce MIRP benchmark dataset
🔎 Similar Papers
No similar papers found.
D
Daniel Wolf
Visual Computing Group, Institute of Media Informatics, Ulm University, Germany
H
Heiko Hillenhagen
Diagnostic and Interventional Radiology, Ulm University Medical Center, Germany
B
Billurvan Taskin
Diagnostic and Interventional Radiology, Ulm University Medical Center, Germany
Alex Bäuerle
Alex Bäuerle
Axiom Bio, USA
M
Meinrad Beer
Diagnostic and Interventional Radiology, Ulm University Medical Center, Germany
Michael Götz
Michael Götz
Junior Professor, Section Experimental Radiology, University Hospital Ulm
Machine LearningPersonalized MedicineRadiomicsTransfer LearningMedical Image Analysis
Timo Ropinski
Timo Ropinski
Ulm University
Visual Computing3D Deep Learning3D Computer VisionData VisualizationComputer Graphics