Thinking in Structures: Evaluating Spatial Intelligence through Reasoning on Constrained Manifolds

📅 2026-02-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitation of current vision-language models (VLMs) in spatial reasoning, which often rely on unconstrained 2D cues and fail to capture genuine understanding of geometric, topological, and physical constraints in the physical world. To this end, the authors introduce SSI-Bench, the first benchmark for constrained-manifold spatial reasoning, comprising 1,000 fully human-designed 3D structural ordering problems that encompass tasks such as mental rotation, cross-section inference, and occlusion reasoning. The benchmark eliminates pixel-level shortcuts through structured component decomposition and expert human annotations. Experiments on 31 prominent VLMs reveal that even the best open-source and closed-source models achieve only 22.2% and 33.6% accuracy, respectively—substantially below the human performance of 91.6%—highlighting a significant gap in structured 3D spatial reasoning capabilities.

Technology Category

Application Category

📝 Abstract
Spatial intelligence is crucial for vision--language models (VLMs) in the physical world, yet many benchmarks evaluate largely unconstrained scenes where models can exploit 2D shortcuts. We introduce SSI-Bench, a VQA benchmark for spatial reasoning on constrained manifolds, built from complex real-world 3D structures whose feasible configurations are tightly governed by geometric, topological, and physical constraints. SSI-Bench contains 1,000 ranking questions spanning geometric and topological reasoning and requiring a diverse repertoire of compositional spatial operations, such as mental rotation, cross-sectional inference, occlusion reasoning, and force-path reasoning. It is created via a fully human-centered pipeline: ten researchers spent over 400 hours curating images, annotating structural components, and designing questions to minimize pixel-level cues. Evaluating 31 widely used VLMs reveals a large gap to humans: the best open-source model achieves 22.2% accuracy and the strongest closed-source model reaches 33.6%, while humans score 91.6%. Encouraging models to think yields only marginal gains, and error analysis points to failures in structural grounding and constraint-consistent 3D reasoning. Project page: https://ssi-bench.github.io.
Problem

Research questions and friction points this paper is trying to address.

spatial intelligence
constrained manifolds
vision-language models
spatial reasoning
3D reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

spatial intelligence
constrained manifolds
3D reasoning
visual-language models
structural grounding
🔎 Similar Papers
No similar papers found.
C
Chen Yang
Tsinghua University
G
Guanxin Lin
Tsinghua University
Y
Youquan He
Tsinghua University
P
Peiyao Chen
Tsinghua University
G
Guanghe Liu
Tsinghua University
Y
Yufan Mo
Tsinghua University
Z
Zhouyuan Xu
Tsinghua University
L
Linhao Wang
Tsinghua University
Guohui Zhang
Guohui Zhang
Professor of Civil Engineering, University of Hawaii
Traffic EngineeringITSTraffic DetectionTraffic System ModelingSimulation
Z
Zihang Zhang
Tsinghua University
S
Shenxiang Zeng
Tsinghua University
Chen Wang
Chen Wang
East China Normal University, Tsinghua University, The Hebrew University of Jerusalem
supramolecular chemistryDNA nanotechnology
J
Jiansheng Fan
Tsinghua University