🤖 AI Summary
This study addresses the limited spatial reasoning transparency and predominantly single-frame 2D evaluation of existing medical vision-language models, which hinder their applicability to clinical MRI’s inherent 3D multi-slice nature. To bridge this gap, we introduce SGMRI-VQA—the first voxel-based, multi-slice spatially grounded visual question answering benchmark for MRI—comprising 41,307 radiologist-annotated question-answer pairs spanning four task levels: detection, localization, counting/classification, and description. Notably, it incorporates multi-slice spatial grounding and chain-of-thought annotations for the first time. Built upon the fastMRI+ dataset, we fine-tune the Qwen3-VL-8B model with bounding-box supervision and integrate slice-index coordinates to enable cross-slice spatial alignment. Experiments demonstrate that bounding-box supervision substantially enhances spatial grounding performance across ten vision-language models, underscoring the efficacy of targeted spatial supervision in advancing clinically interpretable reasoning.
📝 Abstract
Spatial reasoning and visual grounding are core capabilities for vision-language models (VLMs), yet most medical VLMs produce predictions without transparent reasoning or spatial evidence. Existing benchmarks also evaluate VLMs on isolated 2D images, overlooking the volumetric nature of clinical imaging, where findings can span multiple frames or appear on only a few slices. We introduce Spatially Grounded MRI Visual Question Answering (SGMRI-VQA), a 41,307-pair benchmark for multi-frame, spatially grounded reasoning on volumetric MRI. Built from expert radiologist annotations in the fastMRI+ dataset across brain and knee studies, each QA pair includes a clinician-aligned chain-of-thought trace with frame-indexed bounding box coordinates. Tasks are organized hierarchically across detection, localization, counting/classification, and captioning, requiring models to jointly reason about what is present, where it is, and across which frames it extends. We benchmark 10 VLMs and show that supervised fine-tuning of Qwen3-VL-8B with bounding box supervision consistently improves grounding performance over strong zero-shot baselines, indicating that targeted spatial supervision is an effective path toward grounded clinical reasoning.