🤖 AI Summary
This work addresses zero-shot 3D vision-language grounding—localizing objects in 3D scenes via natural language queries—without requiring 3D annotations, predefined categories, or scene priors, thereby supporting downstream applications such as AR and robotics. Methodologically, we propose the first zero-shot 3D grounding framework built upon pre-trained 2D vision-language models (VLMs). It introduces two novel components: (i) view-adaptive selection to identify optimal viewpoints for query alignment, and (ii) cross-modal fusion and alignment to bridge the modality gap between 2D VLM semantics and 3D spatial reasoning. Fine-grained localization is achieved through multi-view rendering, spatially enhanced textual encoding, and dynamic feature alignment. On ScanRefer and Nr3D, our method achieves zero-shot grounding performance improvements of +7.7% and +7.1% over prior art, respectively, matching fully supervised methods—demonstrating strong generalization and practical viability.
📝 Abstract
3D Visual Grounding (3DVG) seeks to locate target objects in 3D scenes using natural language descriptions, enabling downstream applications such as augmented reality and robotics. Existing approaches typically rely on labeled 3D data and predefined categories, limiting scalability to open-world settings. We present SeeGround, a zero-shot 3DVG framework that leverages 2D Vision-Language Models (VLMs) to bypass the need for 3D-specific training. To bridge the modality gap, we introduce a hybrid input format that pairs query-aligned rendered views with spatially enriched textual descriptions. Our framework incorporates two core components: a Perspective Adaptation Module that dynamically selects optimal viewpoints based on the query, and a Fusion Alignment Module that integrates visual and spatial signals to enhance localization precision. Extensive evaluations on ScanRefer and Nr3D confirm that SeeGround achieves substantial improvements over existing zero-shot baselines -- outperforming them by 7.7% and 7.1%, respectively -- and even rivals fully supervised alternatives, demonstrating strong generalization under challenging conditions.