Your other Left! Vision-Language Models Fail to Identify Relative Positions in Medical Images

📅 2025-08-01

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

Medical vision-language models (VLMs) exhibit severe deficiencies in clinically critical relative spatial reasoning—specifically, identifying the anatomical or pathological relationships between structures and abnormalities. To address this gap, we introduce MIRP, the first benchmark explicitly designed for evaluating relative localization in medical imaging. MIRP systematically assesses leading VLMs—including GPT-4o, Llama3.2, Pixtral, and JanusPro—on fine-grained spatial relation understanding. Experimental results reveal that these models heavily rely on textual priors rather than visual evidence, yielding substantially lower localization accuracy compared to general-domain counterparts. Moreover, augmenting inputs with alphanumeric or color-coded visual cues yields only marginal improvements, underscoring fundamental limitations in spatial reasoning. MIRP fills a critical void in the evaluation of fine-grained spatial comprehension for medical VLMs and establishes a standardized, empirically grounded testbed to advance research on spatially aware medical AI.

Technology Category

Application Category

📝 Abstract

Clinical decision-making relies heavily on understanding relative positions of anatomical structures and anomalies. Therefore, for Vision-Language Models (VLMs) to be applicable in clinical practice, the ability to accurately determine relative positions on medical images is a fundamental prerequisite. Despite its importance, this capability remains highly underexplored. To address this gap, we evaluate the ability of state-of-the-art VLMs, GPT-4o, Llama3.2, Pixtral, and JanusPro, and find that all models fail at this fundamental task. Inspired by successful approaches in computer vision, we investigate whether visual prompts, such as alphanumeric or colored markers placed on anatomical structures, can enhance performance. While these markers provide moderate improvements, results remain significantly lower on medical images compared to observations made on natural images. Our evaluations suggest that, in medical imaging, VLMs rely more on prior anatomical knowledge than on actual image content for answering relative position questions, often leading to incorrect conclusions. To facilitate further research in this area, we introduce the MIRP , Medical Imaging Relative Positioning, benchmark dataset, designed to systematically evaluate the capability to identify relative positions in medical images.

Problem

Research questions and friction points this paper is trying to address.

VLMs fail to identify relative positions in medical images

Visual prompts improve performance but remain insufficient

VLMs rely on prior knowledge over image content

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluate VLMs on medical image positioning

Use visual prompts to enhance performance

Introduce MIRP benchmark dataset

🔎 Similar Papers

Multi-modal vision-language model for generalizable annotation-free pathology localization and clinical diagnosis