🤖 AI Summary
Weak generalization in 3D vision grounding (3DVG) stems from scarce 3D vision-language data and limited 3D reasoning capabilities of vision-language models (VLMs). To address this, we propose a panoramic rendering-based cross-modal alignment framework. Our method introduces: (1) a panoramic intermediate representation that jointly encodes geometric and semantic information of 3D scenes; (2) a three-stage paradigm—view selection, single-view grounding, and 3D lifting—that balances long-range contextual modeling with plug-and-play integration of pre-trained VLMs; and (3) layout-aware view sampling and cross-view prediction fusion to enhance spatial consistency and robustness. Evaluated on ScanRefer and Nr3D, our approach achieves state-of-the-art performance, demonstrating significant improvements in generalization to unseen scenes and diverse textual descriptions while requiring no 3D-specific model retraining.
📝 Abstract
3D Visual Grounding (3DVG) is a critical bridge from vision-language perception to robotics, requiring both language understanding and 3D scene reasoning. Traditional supervised models leverage explicit 3D geometry but exhibit limited generalization, owing to the scarcity of 3D vision-language datasets and the limited reasoning capabilities compared to modern vision-language models (VLMs). We propose PanoGrounder, a generalizable 3DVG framework that couples multi-modal panoramic representation with pretrained 2D VLMs for strong vision-language reasoning. Panoramic renderings, augmented with 3D semantic and geometric features, serve as an intermediate representation between 2D and 3D, and offer two major benefits: (i) they can be directly fed to VLMs with minimal adaptation and (ii) they retain long-range object-to-object relations thanks to their 360-degree field of view. We devise a three-stage pipeline that places a compact set of panoramic viewpoints considering the scene layout and geometry, grounds a text query on each panoramic rendering with a VLM, and fuses per-view predictions into a single 3D bounding box via lifting. Our approach achieves state-of-the-art results on ScanRefer and Nr3D, and demonstrates superior generalization to unseen 3D datasets and text rephrasings.