🤖 AI Summary
Existing 3D visual grounding (3DVG) methods rely on separately pretrained vision and language encoders, leading to cross-modal misalignment at both spatial-geometric and semantic-category levels—causing localization inaccuracies and classification errors. To address this, we propose the first geometry-semantic joint alignment framework for 3DVG. Our approach introduces a unified representation encoder—adapting CLIP to point clouds—and employs multimodal contrastive learning for fine-grained cross-modal alignment. Additionally, we design a language-guided spatial candidate point selection mechanism. Crucially, our method achieves simultaneous alignment of geometric structure and semantic categories between vision and language features in 3D scenes—the first such solution. Extensive experiments demonstrate state-of-the-art performance on ScanRefer, Nr3D, and Sr3D, with mean accuracy improvements of ≥2.24% over prior work, empirically validating the substantial gains from joint geometric-semantic alignment in both localization precision and semantic matching.
📝 Abstract
3D visual grounding (3DVG) is a critical task in scene understanding that aims to identify objects in 3D scenes based on text descriptions. However, existing methods rely on separately pre-trained vision and text encoders, resulting in a significant gap between the two modalities in terms of spatial geometry and semantic categories. This discrepancy often causes errors in object positioning and classification. The paper proposes UniSpace-3D, which innovatively introduces a unified representation space for 3DVG, effectively bridging the gap between visual and textual features. Specifically, UniSpace-3D incorporates three innovative designs: i) a unified representation encoder that leverages the pre-trained CLIP model to map visual and textual features into a unified representation space, effectively bridging the gap between the two modalities; ii) a multi-modal contrastive learning module that further reduces the modality gap; iii) a language-guided query selection module that utilizes the positional and semantic information to identify object candidate points aligned with textual descriptions. Extensive experiments demonstrate that UniSpace-3D outperforms baseline models by at least 2.24% on the ScanRefer and Nr3D/Sr3D datasets. The code will be made available upon acceptance of the paper.