PanoGrounder: Bridging 2D and 3D with Panoramic Scene Representations for VLM-based 3D Visual Grounding

📅 2025-12-23

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

Weak generalization in 3D vision grounding (3DVG) stems from scarce 3D vision-language data and limited 3D reasoning capabilities of vision-language models (VLMs). To address this, we propose a panoramic rendering-based cross-modal alignment framework. Our method introduces: (1) a panoramic intermediate representation that jointly encodes geometric and semantic information of 3D scenes; (2) a three-stage paradigm—view selection, single-view grounding, and 3D lifting—that balances long-range contextual modeling with plug-and-play integration of pre-trained VLMs; and (3) layout-aware view sampling and cross-view prediction fusion to enhance spatial consistency and robustness. Evaluated on ScanRefer and Nr3D, our approach achieves state-of-the-art performance, demonstrating significant improvements in generalization to unseen scenes and diverse textual descriptions while requiring no 3D-specific model retraining.

Technology Category

Application Category

📝 Abstract

3D Visual Grounding (3DVG) is a critical bridge from vision-language perception to robotics, requiring both language understanding and 3D scene reasoning. Traditional supervised models leverage explicit 3D geometry but exhibit limited generalization, owing to the scarcity of 3D vision-language datasets and the limited reasoning capabilities compared to modern vision-language models (VLMs). We propose PanoGrounder, a generalizable 3DVG framework that couples multi-modal panoramic representation with pretrained 2D VLMs for strong vision-language reasoning. Panoramic renderings, augmented with 3D semantic and geometric features, serve as an intermediate representation between 2D and 3D, and offer two major benefits: (i) they can be directly fed to VLMs with minimal adaptation and (ii) they retain long-range object-to-object relations thanks to their 360-degree field of view. We devise a three-stage pipeline that places a compact set of panoramic viewpoints considering the scene layout and geometry, grounds a text query on each panoramic rendering with a VLM, and fuses per-view predictions into a single 3D bounding box via lifting. Our approach achieves state-of-the-art results on ScanRefer and Nr3D, and demonstrates superior generalization to unseen 3D datasets and text rephrasings.

Problem

Research questions and friction points this paper is trying to address.

Limited generalization of traditional 3D visual grounding models

Scarcity of 3D vision-language datasets for training

Inferior reasoning compared to modern 2D vision-language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Panoramic renderings with 3D features bridge 2D and 3D

Pretrained 2D VLMs process panoramic views for grounding

Multi-view predictions fused into 3D bounding boxes

🔎 Similar Papers

No similar papers found.