View-on-Graph: Zero-shot 3D Visual Grounding via Vision-Language Reasoning on Scene Graphs

📅 2025-12-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Zero-shot 3D visual grounding—localizing objects in unseen 3D scenes using only natural language—remains challenging. Existing VLM+SI (Vision-Language Model + Spatial Information) paradigms project 3D spatial structure into 2D renderings or tokenized videos, causing visual representation entanglement and insufficient exploitation of spatial semantics. Method: We propose VLM×SI, a novel paradigm that externalizes 3D spatial information as a multimodal hierarchical scene graph. A vision-language model acts as an active agent, performing progressive graph traversal and incremental reasoning to generate interpretable localization trajectories. Our approach integrates scene graph construction, multimodal graph representation learning, and CLIP-driven graph reasoning. Contribution/Results: Evaluated on a zero-shot 3D visual grounding benchmark, VLM×SI achieves state-of-the-art performance, significantly improving both localization accuracy and reasoning interpretability over prior methods.

Technology Category

Application Category

📝 Abstract
3D visual grounding (3DVG) identifies objects in 3D scenes from language descriptions. Existing zero-shot approaches leverage 2D vision-language models (VLMs) by converting 3D spatial information (SI) into forms amenable to VLM processing, typically as composite inputs such as specified view renderings or video sequences with overlaid object markers. However, this VLM + SI paradigm yields entangled visual representations that compel the VLM to process entire cluttered cues, making it hard to exploit spatial semantic relationships effectively. In this work, we propose a new VLM x SI paradigm that externalizes the 3D SI into a form enabling the VLM to incrementally retrieve only what it needs during reasoning. We instantiate this paradigm with a novel View-on-Graph (VoG) method, which organizes the scene into a multi-modal, multi-layer scene graph and allows the VLM to operate as an active agent that selectively accesses necessary cues as it traverses the scene. This design offers two intrinsic advantages: (i) by structuring 3D context into a spatially and semantically coherent scene graph rather than confounding the VLM with densely entangled visual inputs, it lowers the VLM's reasoning difficulty; and (ii) by actively exploring and reasoning over the scene graph, it naturally produces transparent, step-by-step traces for interpretable 3DVG. Extensive experiments show that VoG achieves state-of-the-art zero-shot performance, establishing structured scene exploration as a promising strategy for advancing zero-shot 3DVG.
Problem

Research questions and friction points this paper is trying to address.

Zero-shot 3D visual grounding from language descriptions
Existing methods entangle visual cues, hindering spatial reasoning
Proposes structured scene graphs for interpretable, step-by-step reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Externalizes 3D spatial information into a structured scene graph
Uses VLM as active agent to selectively retrieve cues
Enables interpretable step-by-step reasoning for 3D visual grounding
🔎 Similar Papers
No similar papers found.
Y
Yuanyuan Liu
Key Laboratory of Social Computing and Cognitive Intelligence, Dalian University of Technology
Haiyang Mei
Haiyang Mei
National University of Singapore, Dalian University of Technology, ETH Zurich
Computer VisionNeuroinformatics
D
Dongyang Zhan
Key Laboratory of Social Computing and Cognitive Intelligence, Dalian University of Technology
J
Jiayue Zhao
Key Laboratory of Social Computing and Cognitive Intelligence, Dalian University of Technology
D
Dongsheng Zhou
Dalian University
B
Bo Dong
Cephia AI
X
Xin Yang
Key Laboratory of Social Computing and Cognitive Intelligence, Dalian University of Technology