SPAZER: Spatial-Semantic Progressive Reasoning Agent for Zero-shot 3D Visual Grounding

📅 2025-06-27

📈 Citations: 0

✨ Influential: 0

career value

160K/year

🤖 AI Summary

Zero-shot 3D visual grounding (3DVG) suffers from a disconnection between spatial reasoning and semantic understanding, and heavily relies on scarce 3D annotated data. Method: We propose a spatial-semantic progressive reasoning framework that operates fully zero-shot—requiring no 3D fine-tuning—by leveraging multimodal large vision-language models (VLMs). Our approach integrates optimal-view 3D rendering, anchor-guided candidate filtering, and a joint 3D–2D decision mechanism for end-to-end localization. Contribution/Results: Crucially, it unifies 3D geometric structure modeling with 2D semantic perception, overcoming the limitations of unimodal understanding. On ScanRefer and Nr3D benchmarks, our method achieves absolute improvements of +9.0% and +10.9% in grounding accuracy over prior zero-shot approaches, establishing a new paradigm for language-driven 3D localization in unlabeled scenes.

Technology Category

Application Category

📝 Abstract

3D Visual Grounding (3DVG) aims to localize target objects within a 3D scene based on natural language queries. To alleviate the reliance on costly 3D training data, recent studies have explored zero-shot 3DVG by leveraging the extensive knowledge and powerful reasoning capabilities of pre-trained LLMs and VLMs. However, existing paradigms tend to emphasize either spatial (3D-based) or semantic (2D-based) understanding, limiting their effectiveness in complex real-world applications. In this work, we introduce SPAZER - a VLM-driven agent that combines both modalities in a progressive reasoning framework. It first holistically analyzes the scene and produces a 3D rendering from the optimal viewpoint. Based on this, anchor-guided candidate screening is conducted to perform a coarse-level localization of potential objects. Furthermore, leveraging retrieved relevant 2D camera images, 3D-2D joint decision-making is efficiently performed to determine the best-matching object. By bridging spatial and semantic reasoning neural streams, SPAZER achieves robust zero-shot grounding without training on 3D-labeled data. Extensive experiments on ScanRefer and Nr3D benchmarks demonstrate that SPAZER significantly outperforms previous state-of-the-art zero-shot methods, achieving notable gains of 9.0% and 10.9% in accuracy.

Problem

Research questions and friction points this paper is trying to address.

Bridging spatial-semantic reasoning for 3D visual grounding.

Reducing reliance on costly 3D training data.

Improving zero-shot 3DVG accuracy without labeled data.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Progressive reasoning combines spatial and semantic modalities

Anchor-guided screening for coarse-level object localization

3D-2D joint decision-making for best-matching object

🔎 Similar Papers

No similar papers found.