Lost in Volume: The CT-SpatialVQA Benchmark for Evaluating Semantic-Spatial Understanding of 3D Medical Vision-Language Models

📅 2026-05-09

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

This study addresses the lack of systematic evaluation on whether existing 3D medical vision-language models genuinely comprehend the spatial semantics of anatomical structures. To this end, we introduce CT-SpatialVQA, the first benchmark comprising 9,077 high-quality spatial reasoning questions derived from 1,601 clinical CT scans and associated radiology reports. The benchmark encompasses tasks such as anatomical localization, left–right discrimination, structural comparison, and 3D relational reasoning. Leveraging LLM-assisted annotation, rigorous image–report alignment, and a standardized evaluation protocol, our analysis reveals that eight state-of-the-art models achieve an average accuracy of only 34%, with performance on multiple tasks falling below random chance. These findings underscore a critical deficiency in current models’ ability to interpret volumetric data for spatial semantic understanding.

📝 Abstract

Recent advances in 3D medical vision-language models have enabled joint reasoning over volumetric images and text, showing strong performance in medical visual question-answering (VQA) and report generation. Despite this progress, it remains unclear whether these models learn spatially grounded anatomy from 3D volumes or rely primarily on learned priors and language correlations. This uncertainty stems from the lack of systematic evaluation of semantic-spatial reasoning in volumetric medical VLMs for clinically reliable decision support. To address this gap, we introduce CT-SpatialVQA, a benchmark designed to evaluate semantic-spatial reasoning in 3D CT data. The benchmark comprises 9077 clinically grounded question-answer (QA) pairs derived directly from 1601 radiology reports and CT volumes, which are validated via a robust LLM-assisted pipeline with a 95% human consensus agreement rate. Our dataset requires explicit anatomical localization, laterality awareness, structural comparison, and 3D inter-structure relational reasoning. We also introduce a standardized evaluation protocol and benchmark eight 3D medical VLMs, finding severe degradation on semantic-spatial reasoning tasks, averaging 34% accuracy and often below random, highlighting the need for deeper integration of volumetric evidence for trustworthy clinical use.

Problem

Research questions and friction points this paper is trying to address.

semantic-spatial reasoning

3D medical vision-language models

volumetric understanding

clinical reliability

anatomical localization

Innovation

Methods, ideas, or system contributions that make the work stand out.

semantic-spatial reasoning

3D medical vision-language models

CT-SpatialVQA benchmark