Enhancing Spatial Reasoning through Visual and Textual Thinking

📅 2025-07-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current vision-language models (VLMs) exhibit limited capability in spatial reasoning—particularly for 2D/3D relational understanding. To address this, we propose SpatialVTS, the first framework enabling synergistic dual-path (visual and textual) reasoning: it generates position-sensitive tokens guided by visual cues and models spatial logic via conversational, long-context language reasoning—without requiring auxiliary annotations (e.g., masks or depth maps). SpatialVTS autonomously localizes key objects and infers latent spatial associations. We systematically reconstruct multiple benchmark datasets by injecting explicit spatial logical chains and human-verified reasoning traces. Experiments demonstrate that SpatialVTS achieves substantial average performance gains over state-of-the-art methods across diverse spatial understanding tasks—including visual question answering and embodied AI—validating its effectiveness, generalizability, and real-world applicability.

Technology Category

Application Category

📝 Abstract
The spatial reasoning task aims to reason about the spatial relationships in 2D and 3D space, which is a fundamental capability for Visual Question Answering (VQA) and robotics. Although vision language models (VLMs) have developed rapidly in recent years, they are still struggling with the spatial reasoning task. In this paper, we introduce a method that can enhance Spatial reasoning through Visual and Textual thinking Simultaneously (SpatialVTS). In the spatial visual thinking phase, our model is trained to generate location-related specific tokens of essential targets automatically. Not only are the objects mentioned in the problem addressed, but also the potential objects related to the reasoning are considered. During the spatial textual thinking phase, Our model conducts long-term thinking based on visual cues and dialogues, gradually inferring the answers to spatial reasoning problems. To effectively support the model's training, we perform manual corrections to the existing spatial reasoning dataset, eliminating numerous incorrect labels resulting from automatic annotation, restructuring the data input format to enhance generalization ability, and developing thinking processes with logical reasoning details. Without introducing additional information (such as masks or depth), our model's overall average level in several spatial understanding tasks has significantly improved compared with other models.
Problem

Research questions and friction points this paper is trying to address.

Improving spatial reasoning in vision-language models
Enhancing 2D and 3D spatial relationship understanding
Correcting and refining spatial reasoning datasets
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates location-related tokens automatically
Conducts long-term thinking with visual cues
Enhances dataset with manual corrections
🔎 Similar Papers
No similar papers found.
X
Xun Liang
State Key Lab of CAD&CG, Zhejiang University
X
Xin Guo
Alibaba Cloud Computing
Zhongming Jin
Zhongming Jin
Alibaba DAMO Academy
Machine LearningComputer VisionInformation Retrieval
W
Weihang Pan
School of Software Technology, Zhejiang University
P
Penghui Shang
Xihu, Hangzhou Zhiyuan Research Institute Co., Ltd
Deng Cai
Deng Cai
Professor of Computer Science, Zhejiang University
Machine learningComputer visionData miningInformation retrieval
B
Binbin Lin
School of Software Technology, Zhejiang University
J
Jieping Ye
Alibaba Cloud Computing