SceneGraphGrounder: Zero-Shot 3D Visual Grounding via Structured Scene Graph Matching

📅 2026-05-20

📈 Citations: 0

✨ Influential: 0

career value

157K/year

🤖 AI Summary

This work addresses the challenges of weak spatial consistency and poor interpretability of compositional queries in zero-shot 3D visual grounding within unstructured environments. The authors propose reframing the task as a structured scene graph matching problem: leveraging vision-token prompts to guide a vision-language model in inferring inter-object relationships from multi-view RGB-D inputs, thereby constructing a consistent and persistent 3D scene graph. This graph is then aligned under constraints with another graph derived from the natural language query. By introducing structured graph matching into zero-shot 3D grounding for the first time, the method achieves state-of-the-art performance on ScanRefer and demonstrates robust long-term spatial reasoning capabilities on a real robotic platform.

📝 Abstract

Zero-shot 3D visual grounding requires localizing objects in unstructured environments from free-form natural language. Recent vision-language model (VLM) approaches achieve promising results but rely on view-dependent reasoning or implicit representations, limiting spatial consistency and interpretability for compositional queries. We propose SceneGraphGrounder, a framework that reformulates 3D grounding as structured graph matching over a reconstructed 3D scene graph. To enable this formulation, we introduce a visual marker prompting strategy that enables a VLM to infer object-object relationships from 2D views, which are subsequently lifted into a persistent 3D scene graph encoding both spatial and semantic relations. Given a query, we construct a query graph and perform constrained alignment with the scene graph, ensuring multi-view consistency and interpretable reasoning. Experiments on the ScanRefer benchmark demonstrate that our method achieves competitive performance among zero-shot approaches, using only RGB-D inputs. We further validate our framework through real-world deployment on a mobile robot, demonstrating robust spatial reasoning in long-horizon physical environments. We will make our code publicly available upon acceptance.

Problem

Research questions and friction points this paper is trying to address.

zero-shot 3D visual grounding

structured scene graph

compositional queries

spatial consistency

visual grounding

Innovation

Methods, ideas, or system contributions that make the work stand out.

scene graph matching

zero-shot 3D visual grounding

visual marker prompting