🤖 AI Summary
Existing 3D visual grounding (3DVG) methods in 3D Gaussian Splatting (3DGS) scenes suffer from scene-specific training, heavy reliance on dense 3D annotations, and poor generalization. Method: We propose GVR, the first zero-shot, text-driven 3D object localization framework for 3DGS. GVR reformulates 3D grounding as a multi-view 2D object-level cross-modal retrieval task, leveraging text-image matching to derive localization cues and introducing a multi-view consistency fusion mechanism to enhance robustness. Crucially, it requires no 3D annotations or scene-specific training and is fully decoupled from the 3D reconstruction pipeline. Results: GVR achieves state-of-the-art performance on ScanRefer and SR3D benchmarks, significantly reducing annotation and training overhead while enabling real-time, open-vocabulary object localization in real-world robotic scenarios.
📝 Abstract
3D Visual Grounding (3DVG) aims to locate objects in 3D scenes based on text prompts, which is essential for applications such as robotics. However, existing 3DVG methods encounter two main challenges: first, they struggle to handle the implicit representation of spatial textures in 3D Gaussian Splatting (3DGS), making per-scene training indispensable; second, they typically require larges amounts of labeled data for effective training. To this end, we propose underline{G}rounding via underline{V}iew underline{R}etrieval (GVR), a novel zero-shot visual grounding framework for 3DGS to transform 3DVG as a 2D retrieval task that leverages object-level view retrieval to collect grounding clues from multiple views, which not only avoids the costly process of 3D annotation, but also eliminates the need for per-scene training. Extensive experiments demonstrate that our method achieves state-of-the-art visual grounding performance while avoiding per-scene training, providing a solid foundation for zero-shot 3DVG research. Video demos can be found in https://github.com/leviome/GVR_demos.