๐ค AI Summary
Existing multimodal large language models (MLLMs) exhibit weak spatial intelligence, and mainstream benchmarks either focus on qualitative reasoning or rely on indoor datasets, lacking open-world evaluation with verifiable ground truth. Method: We introduce the first open-world multimodal spatial reasoning benchmark, built upon synchronized stereo vision, LiDAR, and IMU/GPS data captured from a pedestrian perspective. Leveraging 3D reconstruction and automated synthesis, it generates hierarchical spatial questions spanning qualitative relations to quantitative kinematics. We further propose a verifiable ground-truth annotation framework and diagnostic methods for anomalous scenes and visual occlusions. Contribution/Results: Our evaluation revealsโ for the first timeโthat MLLMs heavily rely on linguistic priors rather than geometric reasoning: their accuracy on quantitative spatial tasks is 42% lower than human baselines, and their indoor performance advantage vanishes entirely in open-world settings.
๐ Abstract
While Multimodal Large Language Models (MLLMs) have achieved impressive performance on semantic tasks, their spatial intelligence--crucial for robust and grounded AI systems--remains underdeveloped. Existing benchmarks fall short of diagnosing this limitation: they either focus on overly simplified qualitative reasoning or rely on domain-specific indoor data, constrained by the lack of outdoor datasets with verifiable metric ground truth. To bridge this gap, we introduce a large-scale benchmark built from pedestrian-perspective videos captured with synchronized stereo cameras, LiDAR, and IMU/GPS sensors. This dataset provides metrically precise 3D information, enabling the automatic generation of spatial reasoning questions that span a hierarchical spectrum--from qualitative relational reasoning to quantitative metric and kinematic understanding. Evaluations reveal that the performance gains observed in structured indoor benchmarks vanish in open-world settings. Further analysis using synthetic abnormal scenes and blinding tests confirms that current MLLMs depend heavily on linguistic priors instead of grounded visual reasoning. Our benchmark thus provides a principled platform for diagnosing these limitations and advancing physically grounded spatial intelligence.