Can Multimodal Large Language Models Understand Spatial Relations?

📅 2025-05-25

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

Current multimodal large language models (MLLMs) exhibit significant deficiencies in understanding spatial relations—such as left/right, front/back, and above/below—in real-world images. Existing benchmarks rely heavily on bounding boxes, ignore viewpoint variations, or inadvertently leak prior knowledge, compromising ecological validity. To address this, we propose SpatialMQA, the first benchmark explicitly designed for viewpoint-aware spatial relation reasoning in natural images. It comprises 5,392 high-quality, human-annotated triplets derived from COCO2017. Our novel multi-stage collaborative annotation protocol enforces viewpoint consistency constraints, ensuring evaluations are bounding-box-free, free of prior knowledge leakage, and strictly require viewpoint-aware reasoning. Empirical results show that state-of-the-art MLLMs achieve only 48.14% accuracy—substantially below human performance (98.40%)—confirming spatial reasoning remains a critical bottleneck. The benchmark dataset and code are publicly released.

Technology Category

Application Category

📝 Abstract

Spatial relation reasoning is a crucial task for multimodal large language models (MLLMs) to understand the objective world. However, current benchmarks have issues like relying on bounding boxes, ignoring perspective substitutions, or allowing questions to be answered using only the model's prior knowledge without image understanding. To address these issues, we introduce SpatialMQA, a human-annotated spatial relation reasoning benchmark based on COCO2017, which enables MLLMs to focus more on understanding images in the objective world. To ensure data quality, we design a well-tailored annotation procedure, resulting in SpatialMQA consisting of 5,392 samples. Based on this benchmark, a series of closed- and open-source MLLMs are implemented and the results indicate that the current state-of-the-art MLLM achieves only 48.14% accuracy, far below the human-level accuracy of 98.40%. Extensive experimental analyses are also conducted, suggesting the future research directions. The benchmark and codes are available at https://github.com/ziyan-xiaoyu/SpatialMQA.git.

Problem

Research questions and friction points this paper is trying to address.

Evaluating MLLMs' spatial relation understanding in images

Addressing current benchmarks' reliance on prior knowledge over image comprehension

Introducing SpatialMQA to improve image-focused spatial reasoning in MLLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces SpatialMQA benchmark for spatial reasoning

Uses human-annotated COCO2017 dataset

Evaluates MLLMs' image understanding accuracy

🔎 Similar Papers

No similar papers found.