Spatial-DISE: A Unified Benchmark for Evaluating Spatial Reasoning in Vision-Language Models

📅 2025-10-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing benchmarks inadequately evaluate vision-language models’ (VLMs’) core human spatial cognition capabilities—particularly intrinsic dynamic spatial reasoning. To address this, we propose Spatial-DISE, a cognition-driven, unified benchmark that introduces the first four-dimensional spatial reasoning taxonomy: intrinsic static, intrinsic dynamic, extrinsic static, and extrinsic dynamic. Leveraging an automated generation pipeline, we concurrently construct a high-quality evaluation set and a large-scale training corpus. Comprehensive evaluation across 28 state-of-the-art VLMs reveals substantial performance gaps—especially on multi-step, multi-view dynamic reasoning tasks—relative to human-level competence. Spatial-DISE not only identifies critical bottlenecks in current VLMs’ spatial intelligence but also provides the first scalable, verifiable, and multidimensional resource for both spatial reasoning evaluation and training. By bridging cognitive theory with machine learning practice, it establishes a foundational platform for advancing human-like spatial intelligence research.

Technology Category

Application Category

📝 Abstract
Spatial reasoning ability is crucial for Vision Language Models (VLMs) to support real-world applications in diverse domains including robotics, augmented reality, and autonomous navigation. Unfortunately, existing benchmarks are inadequate in assessing spatial reasoning ability, especially the emph{intrinsic-dynamic} spatial reasoning which is a fundamental aspect of human spatial cognition. In this paper, we propose a unified benchmark, extbf{Spatial-DISE}, based on a cognitively grounded taxonomy that categorizes tasks into four fundamental quadrants: extbf{I}ntrinsic- extbf{S}tatic, Intrinsic- extbf{D}ynamic, extbf{E}xtrinsic-Static, and Extrinsic-Dynamic spatial reasoning. Moreover, to address the issue of data scarcity, we develop a scalable and automated pipeline to generate diverse and verifiable spatial reasoning questions, resulting in a new extbf{Spatial-DISE} dataset that includes Spatial-DISE Bench (559 evaluation VQA pairs) and Spatial-DISE-12K (12K+ training VQA pairs). Our comprehensive evaluation across 28 state-of-the-art VLMs reveals that, current VLMs have a large and consistent gap to human competence, especially on multi-step multi-view spatial reasoning. Spatial-DISE offers a robust framework, valuable dataset, and clear direction for future research toward human-like spatial intelligence. Benchmark, dataset, and code will be publicly released.
Problem

Research questions and friction points this paper is trying to address.

Evaluating spatial reasoning in vision-language models
Addressing inadequate benchmarks for dynamic spatial reasoning
Generating scalable datasets for multi-step spatial cognition
Innovation

Methods, ideas, or system contributions that make the work stand out.

Developed unified benchmark for spatial reasoning evaluation
Created automated pipeline for generating spatial reasoning questions
Produced comprehensive dataset with 12K+ training pairs
🔎 Similar Papers
No similar papers found.
X
Xinmiao Huang
Department of Computer Science, University of Liverpool
Q
Qisong He
Department of Computer Science, University of Liverpool
Z
Zhenglin Huang
Department of Computer Science, University of Liverpool
B
Boxuan Wang
Department of Computer Science, University of Liverpool
Z
Zhuoyun Li
Department of Computer Science, University of Liverpool
Guangliang Cheng
Guangliang Cheng
Reader (Associate Professor) in University of Liverpool
Computer VisionDeepfake DetectionAutonomous DrivingRobotics
Y
Yi Dong
Department of Computer Science, University of Liverpool
Xiaowei Huang
Xiaowei Huang
Professor of Computer Science, University of Liverpool
AI Safety and SecurityVerificationTrustworthy AIFormal MethodsExplainable AI