Can Multimodal Large Language Models Understand Spatial Relations?

📅 2025-05-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current multimodal large language models (MLLMs) exhibit significant deficiencies in understanding spatial relations—such as left/right, front/back, and above/below—in real-world images. Existing benchmarks rely heavily on bounding boxes, ignore viewpoint variations, or inadvertently leak prior knowledge, compromising ecological validity. To address this, we propose SpatialMQA, the first benchmark explicitly designed for viewpoint-aware spatial relation reasoning in natural images. It comprises 5,392 high-quality, human-annotated triplets derived from COCO2017. Our novel multi-stage collaborative annotation protocol enforces viewpoint consistency constraints, ensuring evaluations are bounding-box-free, free of prior knowledge leakage, and strictly require viewpoint-aware reasoning. Empirical results show that state-of-the-art MLLMs achieve only 48.14% accuracy—substantially below human performance (98.40%)—confirming spatial reasoning remains a critical bottleneck. The benchmark dataset and code are publicly released.

Technology Category

Application Category

📝 Abstract
Spatial relation reasoning is a crucial task for multimodal large language models (MLLMs) to understand the objective world. However, current benchmarks have issues like relying on bounding boxes, ignoring perspective substitutions, or allowing questions to be answered using only the model's prior knowledge without image understanding. To address these issues, we introduce SpatialMQA, a human-annotated spatial relation reasoning benchmark based on COCO2017, which enables MLLMs to focus more on understanding images in the objective world. To ensure data quality, we design a well-tailored annotation procedure, resulting in SpatialMQA consisting of 5,392 samples. Based on this benchmark, a series of closed- and open-source MLLMs are implemented and the results indicate that the current state-of-the-art MLLM achieves only 48.14% accuracy, far below the human-level accuracy of 98.40%. Extensive experimental analyses are also conducted, suggesting the future research directions. The benchmark and codes are available at https://github.com/ziyan-xiaoyu/SpatialMQA.git.
Problem

Research questions and friction points this paper is trying to address.

Evaluating MLLMs' spatial relation understanding in images
Addressing current benchmarks' reliance on prior knowledge over image comprehension
Introducing SpatialMQA to improve image-focused spatial reasoning in MLLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces SpatialMQA benchmark for spatial reasoning
Uses human-annotated COCO2017 dataset
Evaluates MLLMs' image understanding accuracy
🔎 Similar Papers
No similar papers found.
Jingping Liu
Jingping Liu
ECUST
large language modelknowledge graph
Z
Ziyan Liu
School of Information Science and Engineering, East China University of Science and Technology, Shanghai, China
Z
Zhedong Cen
School of Information Science and Engineering, East China University of Science and Technology, Shanghai, China
Y
Yan Zhou
School of Information Science and Engineering, East China University of Science and Technology, Shanghai, China
Y
Yinan Zou
School of Information Science and Engineering, East China University of Science and Technology, Shanghai, China
W
Weiyan Zhang
School of Information Science and Engineering, East China University of Science and Technology, Shanghai, China
Haiyun Jiang
Haiyun Jiang
Associate Professor, Shanghai Jiao Tong University,
(Multimodal) Large ModelIntelligent Target RecognitionKnowledge Graph
Tong Ruan
Tong Ruan
East China University of Science and Technology
Clinical NLPLLMKG