🤖 AI Summary
Current medical multimodal large language models generally lack spatial reasoning capabilities in three-dimensional (3D) medical imaging, primarily due to the scarcity of high-quality, structured 3D spatial annotations. To address this gap, this work proposes and constructs SpatialMed—the first comprehensive benchmark for evaluating 3D spatial intelligence in medical multimodal large language models. Leveraging a multi-agent collaborative framework integrated with volumetric and distance computation tools, along with validation by radiology experts, the authors automatically generate nearly 10,000 high-quality 3D spatial visual question-answering pairs spanning diverse organs and tumor types. Evaluations across 14 state-of-the-art models reveal significant deficiencies in current models’ medical spatial understanding, underscoring the necessity and effectiveness of the proposed benchmark.
📝 Abstract
Visual spatial intelligence is critical for medical image interpretation, yet remains largely unexplored in Multimodal Large Language Models (MLLMs) for 3D imaging. This gap persists due to a systemic lack of datasets featuring structured 3D spatial annotations beyond basic labels. In this study, we introduce an agentic pipeline that autonomously synthesizes spatial visual question-answering (VQA) data by orchestrating computational tools such as volume and distance calculators with multi-agent collaboration and expert radiologist validation. We present SpatialMed, the first comprehensive benchmark for evaluating 3D spatial intelligence in medical MLLMs, comprising nearly 10K question-answer pairs across multiple organs and tumor types. Our evaluations on 14 state-of-the-art MLLMs and extensive analyses reveal that current models lack robust spatial reasoning capabilities for medical imaging.