🤖 AI Summary
Current vision-language models exhibit limited spatial reasoning capabilities in large-scale 3D environments (e.g., multi-floor houses), hindering efficient visual question answering (VQA). To address this, we propose SpatialReasoner, an active perception framework for house-level scene understanding, introducing the novel “text-driven–tool-calling–hierarchical-exploration” paradigm. We construct H²U3D—the first benchmark supporting multi-floor, large-scale 3D VQA—and design a coarse-to-fine active exploration strategy that drastically reduces image acquisition. Hierarchical visual representations are automatically annotated and leveraged within a reinforcement learning framework combining supervised cold-start initialization and adaptive exploration rewards, enabling chain-of-thought–driven answer generation. On H²U3D, SpatialReasoner achieves state-of-the-art performance, requiring only 3–4 images on average for inference—substantially outperforming prior methods needing ≥16 images and strong baselines including GPT-4o and Gemini-2.5-Pro.
📝 Abstract
Spatial reasoning in large-scale 3D environments remains challenging for current vision-language models, which are typically constrained to room-scale scenarios. We introduce H$^2$U3D (Holistic House Understanding in 3D), a 3D visual question answering dataset designed for house-scale scene understanding. H$^2$U3D features multi-floor environments spanning up to three floors and 10-20 rooms, covering more than 300 m$^2$. Through an automated annotation pipeline, it constructs hierarchical coarse-to-fine visual representations and generates diverse question-answer pairs with chain-of-thought annotations. We further propose SpatialReasoner, an active perception framework that autonomously invokes spatial tools to explore 3D scenes based on textual queries. SpatialReasoner is trained through a two-stage strategy: a supervised cold start followed by reinforcement learning with an adaptive exploration reward that promotes efficient exploration while discouraging redundant operations. Extensive experiments demonstrate that SpatialReasoner achieves state-of-the-art performance on H$^2$U3D, outperforming strong baselines including GPT-4o and Gemini-2.5-Pro. Notably, our method attains superior results while using only 3-4 images in total on average, compared to baselines requiring 16+ images, highlighting the effectiveness of our coarse-to-fine active exploration paradigm.