🤖 AI Summary
Current multimodal large language models (MLLMs) exhibit significant deficiencies in understanding spatial relations—such as left/right, front/back, and above/below—in real-world images. Existing benchmarks rely heavily on bounding boxes, ignore viewpoint variations, or inadvertently leak prior knowledge, compromising ecological validity. To address this, we propose SpatialMQA, the first benchmark explicitly designed for viewpoint-aware spatial relation reasoning in natural images. It comprises 5,392 high-quality, human-annotated triplets derived from COCO2017. Our novel multi-stage collaborative annotation protocol enforces viewpoint consistency constraints, ensuring evaluations are bounding-box-free, free of prior knowledge leakage, and strictly require viewpoint-aware reasoning. Empirical results show that state-of-the-art MLLMs achieve only 48.14% accuracy—substantially below human performance (98.40%)—confirming spatial reasoning remains a critical bottleneck. The benchmark dataset and code are publicly released.
📝 Abstract
Spatial relation reasoning is a crucial task for multimodal large language models (MLLMs) to understand the objective world. However, current benchmarks have issues like relying on bounding boxes, ignoring perspective substitutions, or allowing questions to be answered using only the model's prior knowledge without image understanding. To address these issues, we introduce SpatialMQA, a human-annotated spatial relation reasoning benchmark based on COCO2017, which enables MLLMs to focus more on understanding images in the objective world. To ensure data quality, we design a well-tailored annotation procedure, resulting in SpatialMQA consisting of 5,392 samples. Based on this benchmark, a series of closed- and open-source MLLMs are implemented and the results indicate that the current state-of-the-art MLLM achieves only 48.14% accuracy, far below the human-level accuracy of 98.40%. Extensive experimental analyses are also conducted, suggesting the future research directions. The benchmark and codes are available at https://github.com/ziyan-xiaoyu/SpatialMQA.git.