3ViewSense: Spatial and Mental Perspective Reasoning from Orthographic Views in Vision-Language Models

📅 2026-03-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the persistent deficiency of vision-language models in fundamental spatial tasks—such as block counting—stemming from their inability to construct consistent 3D mental representations from 2D images, a gap termed the “spatial intelligence gap.” To bridge this, the authors propose 3ViewSense, a novel framework that, for the first time, integrates the orthographic projection principles from engineering drawing into vision-language modeling. By introducing a “simulate-and-reason” mechanism, 3ViewSense decomposes complex scenes into canonical orthographic views, explicitly models mental rotation, and aligns egocentric and allocentric reference frames to establish a view-consistent spatial reasoning interface. The method significantly outperforms existing models across multiple spatial reasoning benchmarks, demonstrating robust performance particularly in heavily occluded counting and view-consistency tasks, while also enhancing the coherence and stability of spatial descriptions.

Technology Category

Application Category

📝 Abstract
Current Large Language Models have achieved Olympiad-level logic, yet Vision-Language Models paradoxically falter on elementary spatial tasks like block counting. This capability mismatch reveals a critical ``spatial intelligence gap,''where models fail to construct coherent 3D mental representations from 2D observations. We uncover this gap via diagnostic analyses showing the bottleneck is a missing view-consistent spatial interface rather than insufficient visual features or weak reasoning. To bridge this, we introduce \textbf{3ViewSense}, a framework that grounds spatial reasoning in Orthographic Views. Drawing on engineering cognition, we propose a ``Simulate-and-Reason''mechanism that decomposes complex scenes into canonical orthographic projections to resolve geometric ambiguities. By aligning egocentric perceptions with these allocentric references, our method facilitates explicit mental rotation and reconstruction. Empirical results on spatial reasoning benchmarks demonstrate that our method significantly outperforms existing baselines, with consistent gains on occlusion-heavy counting and view-consistent spatial reasoning. The framework also improves the stability and consistency of spatial descriptions, offering a scalable path toward stronger spatial intelligence in multimodal systems.
Problem

Research questions and friction points this paper is trying to address.

spatial intelligence gap
3D mental representation
orthographic views
spatial reasoning
vision-language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

orthographic views
spatial reasoning
mental rotation
vision-language models
3D reconstruction
🔎 Similar Papers
No similar papers found.
Shaoxiong Zhan
Shaoxiong Zhan
Tsinghua University
Natural Language ProcessingLarge Language Model
Y
Yanlin Lai
Shenzhen International Graduate School, Tsinghua University, Shenzhen, China
Zheng Liu
Zheng Liu
Wuhan University, China
Single-Molecule BiophysicsMechanobiology
Hai Lin
Hai Lin
Electrical Engineering, University of Notre Dame
Cyber-Physical SystemsHybrid Dynamical SystemsDistributed Cooperative Systems
S
Shen Li
School of Software Engineering, Chongqing University, Chongqing, China
X
Xiaodong Cai
Shenzhen International Graduate School, Tsinghua University, Shenzhen, China
Z
Zijian Lin
Shenzhen International Graduate School, Tsinghua University, Shenzhen, China
Wen Huang
Wen Huang
Tsinghua University
Generative model
H
Hai-Tao Zheng
Shenzhen International Graduate School, Tsinghua University, Shenzhen, China