🤖 AI Summary
Multimodal large language models (MLLMs) exhibit significant limitations in spatial understanding, yet existing evaluations remain fragmented and lack systematicity. To address this, we propose MulSeT—the first benchmark systematically assessing MLLMs’ spatial reasoning across single-view, multi-view, and video modalities, from both data-scale and architectural perspectives. Through multi-task evaluation, training-data scaling, and ablation studies, we identify positional encoding within the visual encoder as the primary bottleneck for spatial comprehension—its impact substantially outweighing that of the language model component; merely scaling training data fails to overcome inherent performance ceilings. Our key contributions are: (1) introducing MulSeT, the first comprehensive spatial understanding benchmark for multi-view and video inputs; (2) establishing the dominant role of visual positional encoding in spatial reasoning; and (3) empirically validating the efficacy of inference-time enhancements (e.g., reasoning injection), thereby providing both theoretical insights and practical guidance for advancing MLLMs’ spatial capabilities.
📝 Abstract
Spatial understanding is essential for Multimodal Large Language Models (MLLMs) to support perception, reasoning, and planning in embodied environments. Despite recent progress, existing studies reveal that MLLMs still struggle with spatial understanding. However, existing research lacks a comprehensive and systematic evaluation of these limitations, often restricted to isolated scenarios, such as single-view or video. In this work, we present a systematic analysis of spatial understanding from both data and architectural perspectives across three representative scenarios: single-view, multi-view, and video. We propose a benchmark named MulSeT (Multi-view Spatial Understanding Tasks), and design a series of experiments to analyze the spatial reasoning capabilities of MLLMs. From the data perspective, the performance of spatial understanding converges quickly as the training data increases, and the upper bound is relatively low, especially for tasks that require spatial imagination. This indicates that merely expanding training data is insufficient to achieve satisfactory performance. From the architectural perspective, we find that spatial understanding relies more heavily on the positional encoding within the visual encoder than within the language model, in both cascaded and native MLLMs. Moreover, we explore reasoning injection and envision future improvements through architectural design to optimize spatial understanding. These insights shed light on the limitations of current MLLMs and suggest new directions for improving spatial reasoning capabilities through data scaling and architectural tuning.