UniGround: Universal 3D Visual Grounding via Training-Free Scene Parsing

📅 2026-03-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limited generalization and poor robustness of existing 3D visual localization methods on out-of-distribution scenes, which often stem from reliance on pre-trained models. To overcome this, the authors propose a training-free, two-stage approach: first performing global candidate filtering using 3D topology and multi-view semantic encoding, followed by local precise localization via multi-scale visual prompts and structured reasoning. This method achieves the first open-world 3D visual localization without requiring 3D supervision, thereby transcending the boundaries of pre-trained knowledge. It sets a new state-of-the-art under zero-shot settings, attaining 46.1%/34.1% Acc@0.25/0.5 on ScanRefer and 28.7% Acc@0.25 on EmbodiedScan, while demonstrating strong robustness in real-world, uncontrolled environments.

Technology Category

Application Category

📝 Abstract
Understanding and localizing objects in complex 3D environments from natural language descriptions, known as 3D Visual Grounding (3DVG), is a foundational challenge in embodied AI, with broad implications for robotics, augmented reality, and human-machine interaction. Large-scale pre-trained foundation models have driven significant progress on this front, enabling open-vocabulary 3DVG that allows systems to locate arbitrary objects in a given scene. However, their reliance on pre-trained models constrains 3D perception and reasoning within the inherited knowledge boundaries, resulting in limited generalization to unseen spatial relationships and poor robustness to out-of-distribution scenes. In this paper, we replace this constrained perception with training-free visual and geometric reasoning, thereby unlocking open-world 3DVG that enables the localization of any object in any scene beyond the training data. Specifically, the proposed UniGround operates in two stages: a Global Candidate Filtering stage that constructs scene candidates through training-free 3D topology and multi-view semantic encoding, and a Local Precision Grounding stage that leverages multi-scale visual prompting and structured reasoning to precisely identify the target object. Experiments on ScanRefer and EmbodiedScan show that UniGround achieves 46.1\%/34.1\% Acc@0.25/0.5 on ScanRefer and 28.7\% Acc@0.25 on EmbodiedScan, establishing a new state-of-the-art among zero-shot methods on EmbodiedScan without any 3D supervision. We further evaluate UniGround in real-world environments under uncontrolled reconstruction conditions and substantial domain shift, showing training-free reasoning generalizes robustly beyond curated benchmarks.
Problem

Research questions and friction points this paper is trying to address.

3D Visual Grounding
open-world perception
zero-shot generalization
out-of-distribution robustness
embodied AI
Innovation

Methods, ideas, or system contributions that make the work stand out.

training-free
3D visual grounding
open-world perception
scene parsing
zero-shot
🔎 Similar Papers
No similar papers found.
Jiaxi Zhang
Jiaxi Zhang
Peking University
Electronic Design Automation
Y
Yunheng Wang
The Hong Kong University of Science and Technology(Guangzhou)
W
Wei Lu
Shanghai Normal University
T
Taowen Wang
The Hong Kong University of Science and Technology(Guangzhou)
W
Weisheng Xu
The Hong Kong University of Science and Technology(Guangzhou)
Shuning Zhang
Shuning Zhang
Tsinghua University
HCIUsable Privacy and SecurityAI
Y
Yixiao Feng
The Hong Kong University of Science and Technology(Guangzhou)
Yuetong Fang
Yuetong Fang
Ph.D. Student, HKUST(GZ)
Brain-inspired computingNeuromorphic ComputingEmbodied AI
Renjing Xu
Renjing Xu
HKUST(GZ)
Brain-inspired ComputingHumanoid Computing