Spatial Reasoning in Multimodal Large Language Models: A Survey of Tasks, Benchmarks and Methods

📅 2025-11-13

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

Multimodal large language models (MLLMs) exhibit significant limitations in human-like spatial reasoning—particularly in 3D relational understanding and manipulation. Method: Moving beyond conventional modality-based taxonomies (text/image/3D), this work introduces, for the first time, a cognitively grounded, hierarchical taxonomy of spatial intelligence, systematically categorizing tasks by reasoning complexity and core cognitive functions (e.g., mental rotation, spatial working memory). Leveraging this taxonomy, we conduct unified evaluation across textual, vision-language, and embodied benchmarks. Contribution/Results: Our analysis uncovers fundamental deficits in high-order spatial reasoning across state-of-the-art models. We further propose a dual-path improvement framework: training enhancements (cognitive-aware data augmentation, architecture refinement) and inference strategies (spatial chain-of-thought, tool-augmented reasoning). This work establishes an interpretable theoretical foundation and a clear technical roadmap for cross-modal spatial intelligence research.

Technology Category

Application Category

📝 Abstract

Spatial reasoning, which requires ability to perceive and manipulate spatial relationships in the 3D world, is a fundamental aspect of human intelligence, yet remains a persistent challenge for Multimodal large language models (MLLMs). While existing surveys often categorize recent progress based on input modality (e.g., text, image, video, or 3D), we argue that spatial ability is not solely determined by the input format. Instead, our survey introduces a taxonomy that organizes spatial intelligence from cognitive aspect and divides tasks in terms of reasoning complexity, linking them to several cognitive functions. We map existing benchmarks across text only, vision language, and embodied settings onto this taxonomy, and review evaluation metrics and methodologies for assessing spatial reasoning ability. This cognitive perspective enables more principled cross-task comparisons and reveals critical gaps between current model capabilities and human-like reasoning. In addition, we analyze methods for improving spatial ability, spanning both training-based and reasoning-based approaches. This dual perspective analysis clarifies their respective strengths, uncovers complementary mechanisms. By surveying tasks, benchmarks, and recent advances, we aim to provide new researchers with a comprehensive understanding of the field and actionable directions for future research.

Problem

Research questions and friction points this paper is trying to address.

Survey assesses spatial reasoning challenges in multimodal language models

Proposes cognitive taxonomy to evaluate spatial intelligence across tasks

Analyzes methods to improve spatial abilities in AI systems

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introducing cognitive taxonomy for spatial reasoning tasks

Mapping benchmarks across modalities to cognitive functions

Analyzing training and reasoning methods for spatial ability

🔎 Similar Papers

No similar papers found.