Zero-Shot Visual Grounding in 3D Gaussians via View Retrieval

📅 2025-09-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing 3D visual grounding (3DVG) methods in 3D Gaussian Splatting (3DGS) scenes suffer from scene-specific training, heavy reliance on dense 3D annotations, and poor generalization. Method: We propose GVR, the first zero-shot, text-driven 3D object localization framework for 3DGS. GVR reformulates 3D grounding as a multi-view 2D object-level cross-modal retrieval task, leveraging text-image matching to derive localization cues and introducing a multi-view consistency fusion mechanism to enhance robustness. Crucially, it requires no 3D annotations or scene-specific training and is fully decoupled from the 3D reconstruction pipeline. Results: GVR achieves state-of-the-art performance on ScanRefer and SR3D benchmarks, significantly reducing annotation and training overhead while enabling real-time, open-vocabulary object localization in real-world robotic scenarios.

Technology Category

Application Category

📝 Abstract
3D Visual Grounding (3DVG) aims to locate objects in 3D scenes based on text prompts, which is essential for applications such as robotics. However, existing 3DVG methods encounter two main challenges: first, they struggle to handle the implicit representation of spatial textures in 3D Gaussian Splatting (3DGS), making per-scene training indispensable; second, they typically require larges amounts of labeled data for effective training. To this end, we propose underline{G}rounding via underline{V}iew underline{R}etrieval (GVR), a novel zero-shot visual grounding framework for 3DGS to transform 3DVG as a 2D retrieval task that leverages object-level view retrieval to collect grounding clues from multiple views, which not only avoids the costly process of 3D annotation, but also eliminates the need for per-scene training. Extensive experiments demonstrate that our method achieves state-of-the-art visual grounding performance while avoiding per-scene training, providing a solid foundation for zero-shot 3DVG research. Video demos can be found in https://github.com/leviome/GVR_demos.
Problem

Research questions and friction points this paper is trying to address.

Locating objects in 3D scenes using text prompts
Handling implicit spatial textures in 3D Gaussian Splatting
Eliminating need for per-scene training and 3D annotations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Zero-shot grounding via view retrieval
Transforms 3DVG into 2D retrieval task
Eliminates per-scene training and 3D annotation
🔎 Similar Papers
No similar papers found.
L
Liwei Liao
Peking University, Peng Cheng Laboratory
X
Xufeng Li
City University of Hongkong
X
Xiaoyun Zheng
Peking University, Peng Cheng Laboratory
Boning Liu
Boning Liu
Peng Cheng Laboratory
F
Feng Gao
Peking University
Ronggang Wang
Ronggang Wang
Shenzhen Graduate School, Peking University
Immersive Video Coding and Processing