From Indoor to Open World: Revealing the Spatial Reasoning Gap in MLLMs

๐Ÿ“… 2025-12-22
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing multimodal large language models (MLLMs) exhibit weak spatial intelligence, and mainstream benchmarks either focus on qualitative reasoning or rely on indoor datasets, lacking open-world evaluation with verifiable ground truth. Method: We introduce the first open-world multimodal spatial reasoning benchmark, built upon synchronized stereo vision, LiDAR, and IMU/GPS data captured from a pedestrian perspective. Leveraging 3D reconstruction and automated synthesis, it generates hierarchical spatial questions spanning qualitative relations to quantitative kinematics. We further propose a verifiable ground-truth annotation framework and diagnostic methods for anomalous scenes and visual occlusions. Contribution/Results: Our evaluation revealsโ€” for the first timeโ€”that MLLMs heavily rely on linguistic priors rather than geometric reasoning: their accuracy on quantitative spatial tasks is 42% lower than human baselines, and their indoor performance advantage vanishes entirely in open-world settings.

Technology Category

Application Category

๐Ÿ“ Abstract
While Multimodal Large Language Models (MLLMs) have achieved impressive performance on semantic tasks, their spatial intelligence--crucial for robust and grounded AI systems--remains underdeveloped. Existing benchmarks fall short of diagnosing this limitation: they either focus on overly simplified qualitative reasoning or rely on domain-specific indoor data, constrained by the lack of outdoor datasets with verifiable metric ground truth. To bridge this gap, we introduce a large-scale benchmark built from pedestrian-perspective videos captured with synchronized stereo cameras, LiDAR, and IMU/GPS sensors. This dataset provides metrically precise 3D information, enabling the automatic generation of spatial reasoning questions that span a hierarchical spectrum--from qualitative relational reasoning to quantitative metric and kinematic understanding. Evaluations reveal that the performance gains observed in structured indoor benchmarks vanish in open-world settings. Further analysis using synthetic abnormal scenes and blinding tests confirms that current MLLMs depend heavily on linguistic priors instead of grounded visual reasoning. Our benchmark thus provides a principled platform for diagnosing these limitations and advancing physically grounded spatial intelligence.
Problem

Research questions and friction points this paper is trying to address.

MLLMs lack robust spatial intelligence for open-world applications
Existing benchmarks fail due to simplified or indoor-only data limitations
Current models rely on linguistic priors over grounded visual reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale outdoor benchmark from pedestrian-perspective videos
Automatic generation of hierarchical spatial reasoning questions
Synthetic abnormal scenes and blinding tests for analysis