Multimodal Spatial Reasoning in the Large Model Era: A Survey and Benchmarks

📅 2025-10-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
A systematic survey and open-source evaluation benchmark for spatial reasoning in large multimodal models (MLLMs) remain absent. Method: This work introduces the first comprehensive taxonomy of multimodal spatial reasoning tasks, covering 2D/3D scene understanding, spatial relation modeling, and embodied intelligence applications. We propose MM-SpatialBench—a scalable, modular, open-source evaluation platform—unifying diverse tasks including visual question answering, 3D localization, and navigation. The framework integrates MLLMs with post-training optimization, cross-modal interpretability analysis, and joint reasoning over multimodal sensor data (e.g., vision, audio, egocentric video). Contribution/Results: Experiments demonstrate significant improvements in model generalization and structured spatial reasoning capabilities on complex tasks. MM-SpatialBench establishes a standardized infrastructure for advancing multimodal spatial cognition research.

Technology Category

Application Category

📝 Abstract
Humans possess spatial reasoning abilities that enable them to understand spaces through multimodal observations, such as vision and sound. Large multimodal reasoning models extend these abilities by learning to perceive and reason, showing promising performance across diverse spatial tasks. However, systematic reviews and publicly available benchmarks for these models remain limited. In this survey, we provide a comprehensive review of multimodal spatial reasoning tasks with large models, categorizing recent progress in multimodal large language models (MLLMs) and introducing open benchmarks for evaluation. We begin by outlining general spatial reasoning, focusing on post-training techniques, explainability, and architecture. Beyond classical 2D tasks, we examine spatial relationship reasoning, scene and layout understanding, as well as visual question answering and grounding in 3D space. We also review advances in embodied AI, including vision-language navigation and action models. Additionally, we consider emerging modalities such as audio and egocentric video, which contribute to novel spatial understanding through new sensors. We believe this survey establishes a solid foundation and offers insights into the growing field of multimodal spatial reasoning. Updated information about this survey, codes and implementation of the open benchmarks can be found at https://github.com/zhengxuJosh/Awesome-Spatial-Reasoning.
Problem

Research questions and friction points this paper is trying to address.

Reviewing multimodal spatial reasoning tasks with large models
Introducing open benchmarks for model evaluation
Examining spatial understanding through emerging modalities
Innovation

Methods, ideas, or system contributions that make the work stand out.

Survey categorizes multimodal large language models progress
Introduces open benchmarks for spatial reasoning evaluation
Reviews emerging modalities like audio and egocentric video
🔎 Similar Papers
No similar papers found.