HiSpatial: Taming Hierarchical 3D Spatial Understanding in Vision-Language Models

πŸ“… 2026-03-26
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the limitations of vision-language models in 3D structural inference, object attribute comprehension, and higher-order spatial reasoning by proposing a hierarchical learning framework that decomposes 3D spatial understanding into four progressive levels: geometric perception, spatial relation modeling, attribute integration, and abstract reasoning. Leveraging an automated pipeline, the authors construct a large-scale RGB-D visual question answering dataset from approximately 5 million images and 45 million objects, further incorporating metric-scale point cloud maps as auxiliary inputs for model fine-tuning. This study establishes the first systematic hierarchical paradigm for 3D spatial understanding, revealing interdependencies among multi-level tasks, and achieves state-of-the-art performance across multiple spatial reasoning benchmarks, outperforming both specialized spatial models and closed-source large language models such as Gemini-2.5-Pro and GPT-5.

Technology Category

Application Category

πŸ“ Abstract
Achieving human-like spatial intelligence for vision-language models (VLMs) requires inferring 3D structures from 2D observations, recognizing object properties and relations in 3D space, and performing high-level spatial reasoning. In this paper, we propose a principled hierarchical framework that decomposes the learning of 3D spatial understanding in VLMs into four progressively complex levels, from geometric perception to abstract spatial reasoning. Guided by this framework, we construct an automated pipeline that processes approximately 5M images with over 45M objects to generate 3D spatial VQA pairs across diverse tasks and scenes for VLM supervised fine-tuning. We also develop an RGB-D VLM incorporating metric-scale point maps as auxiliary inputs to further enhance spatial understanding. Extensive experiments demonstrate that our approach achieves state-of-the-art performance on multiple spatial understanding and reasoning benchmarks, surpassing specialized spatial models and large proprietary systems such as Gemini-2.5-pro and GPT-5. Moreover, our analysis reveals clear dependencies among hierarchical task levels, offering new insights into how multi-level task design facilitates the emergence of 3D spatial intelligence.
Problem

Research questions and friction points this paper is trying to address.

3D spatial understanding
vision-language models
spatial reasoning
hierarchical spatial intelligence
Innovation

Methods, ideas, or system contributions that make the work stand out.

hierarchical spatial understanding
3D visual-language models
RGB-D VLM
spatial reasoning
automated VQA generation
πŸ”Ž Similar Papers
No similar papers found.