Grounded 3D-Aware Spatial Vision-Language Modeling

📅 2026-05-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current vision-language models exhibit limitations in complex spatial reasoning tasks, particularly in modeling 2D and 3D spatial relationships. This work proposes GR3D, a unified framework that integrates explicit and implicit 2D localization with monocular 3D localization to enable a spatial chain-of-thought process from 2D perception to 3D reasoning. GR3D innovatively treats localization as an inductive bias and incorporates region-guided generation, region token insertion, camera-intrinsic-aware normalization, dense geometric supervision, and multimodal joint training to significantly enhance spatial understanding. Experimental results demonstrate consistent performance gains across multiple spatial reasoning benchmarks—both with and without explicit localization annotations—validating that localization capability effectively facilitates general-purpose spatial reasoning.
📝 Abstract
We present GR3D, a spatial vision language model equipped with three complementary grounding capabilities--explicit 2D grounding, implicit 2D grounding, and monocular 3D grounding--within a single framework. GR3D introduces an implicit grounding mechanism that identifies entity mentions during generation and inserts the corresponding region tokens into the text stream, allowing the model to reference visual evidence on the fly when producing spatial chain-of-thought responses. In parallel, a region-prompted monocular 3D grounding design predicts 3D bounding boxes in the camera view from grounded region queries, supported by intrinsic-aware normalization and dense geometric supervision. Together, these grounding capabilities enable GR3D to decompose complex spatial understanding problems into grounded 2D perception followed by 3D inference. GR3D achieves consistent improvements across grounded and non-grounded spatial benchmarks, demonstrating grounding as an effective inductive bias for strengthening spatial understanding in VLMs. These grounding capabilities collectively enhance general spatial understanding beyond the grounding task itself.
Problem

Research questions and friction points this paper is trying to address.

spatial understanding
vision-language modeling
3D grounding
2D grounding
spatial reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

spatial vision-language modeling
implicit grounding
monocular 3D grounding
region prompting
geometric supervision
🔎 Similar Papers
No similar papers found.