π€ AI Summary
This work addresses the lack of systematic evaluation of spatial intelligence (SI) in large vision-language models (VLMs). To this end, we introduce SITEβthe first standardized benchmark comprehensively covering single-image, multi-image, and video modalities across scales from graphical to environmental. Grounded in three foundational cognitive science taxonomies, SITE integrates 31 existing datasets and innovatively designs two task categories: viewpoint sampling and dynamic scene reasoning, both implemented via multiple-choice visual question answering. This enables holistic assessment across modalities, scales, static/dynamic conditions, and intrinsic/extrinsic spatial dimensions. Experiments reveal that state-of-the-art VLMs underperform humans by 32.7% on fundamental SI tasks such as spatial orientation; moreover, SI capability strongly correlates with embodied AI performance (r = 0.81). SITE thus provides a quantifiable, interpretable, and extensible evaluation metric for spatial reasoning in VLMs.
π Abstract
Spatial intelligence (SI) represents a cognitive ability encompassing the visualization, manipulation, and reasoning about spatial relationships, underpinning disciplines from neuroscience to robotics. We introduce SITE, a benchmark dataset towards SI Thorough Evaluation in a standardized format of multi-choice visual question-answering, designed to assess large vision-language models' spatial intelligence across diverse visual modalities (single-image, multi-image, and video) and SI factors (figural to environmental scales, spatial visualization and orientation, intrinsic and extrinsic, static and dynamic). Our approach to curating the benchmark combines a bottom-up survey about 31 existing datasets and a top-down strategy drawing upon three classification systems in cognitive science, which prompt us to design two novel types of tasks about view-taking and dynamic scenes. Extensive experiments reveal that leading models fall behind human experts especially in spatial orientation, a fundamental SI factor. Moreover, we demonstrate a positive correlation between a model's spatial reasoning proficiency and its performance on an embodied AI task.