Zero-Shot 3D Visual Grounding from Vision-Language Models

📅 2025-05-28

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses zero-shot 3D vision-language grounding—localizing objects in 3D scenes via natural language queries—without requiring 3D annotations, predefined categories, or scene priors, thereby supporting downstream applications such as AR and robotics. Methodologically, we propose the first zero-shot 3D grounding framework built upon pre-trained 2D vision-language models (VLMs). It introduces two novel components: (i) view-adaptive selection to identify optimal viewpoints for query alignment, and (ii) cross-modal fusion and alignment to bridge the modality gap between 2D VLM semantics and 3D spatial reasoning. Fine-grained localization is achieved through multi-view rendering, spatially enhanced textual encoding, and dynamic feature alignment. On ScanRefer and Nr3D, our method achieves zero-shot grounding performance improvements of +7.7% and +7.1% over prior art, respectively, matching fully supervised methods—demonstrating strong generalization and practical viability.

Technology Category

Application Category

📝 Abstract

3D Visual Grounding (3DVG) seeks to locate target objects in 3D scenes using natural language descriptions, enabling downstream applications such as augmented reality and robotics. Existing approaches typically rely on labeled 3D data and predefined categories, limiting scalability to open-world settings. We present SeeGround, a zero-shot 3DVG framework that leverages 2D Vision-Language Models (VLMs) to bypass the need for 3D-specific training. To bridge the modality gap, we introduce a hybrid input format that pairs query-aligned rendered views with spatially enriched textual descriptions. Our framework incorporates two core components: a Perspective Adaptation Module that dynamically selects optimal viewpoints based on the query, and a Fusion Alignment Module that integrates visual and spatial signals to enhance localization precision. Extensive evaluations on ScanRefer and Nr3D confirm that SeeGround achieves substantial improvements over existing zero-shot baselines -- outperforming them by 7.7% and 7.1%, respectively -- and even rivals fully supervised alternatives, demonstrating strong generalization under challenging conditions.

Problem

Research questions and friction points this paper is trying to address.

Locate 3D objects using language without 3D training data

Overcome modality gap between 2D VLMs and 3D scenes

Improve zero-shot 3D visual grounding performance significantly

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages 2D Vision-Language Models for 3D

Hybrid input format with enriched descriptions

Dynamic viewpoint selection and fusion alignment

🔎 Similar Papers

No similar papers found.

Authors to Follow