Unified Representation Space for 3D Visual Grounding

📅 2025-06-17

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing 3D visual grounding (3DVG) methods rely on separately pretrained vision and language encoders, leading to cross-modal misalignment at both spatial-geometric and semantic-category levels—causing localization inaccuracies and classification errors. To address this, we propose the first geometry-semantic joint alignment framework for 3DVG. Our approach introduces a unified representation encoder—adapting CLIP to point clouds—and employs multimodal contrastive learning for fine-grained cross-modal alignment. Additionally, we design a language-guided spatial candidate point selection mechanism. Crucially, our method achieves simultaneous alignment of geometric structure and semantic categories between vision and language features in 3D scenes—the first such solution. Extensive experiments demonstrate state-of-the-art performance on ScanRefer, Nr3D, and Sr3D, with mean accuracy improvements of ≥2.24% over prior work, empirically validating the substantial gains from joint geometric-semantic alignment in both localization precision and semantic matching.

Technology Category

Application Category

📝 Abstract

3D visual grounding (3DVG) is a critical task in scene understanding that aims to identify objects in 3D scenes based on text descriptions. However, existing methods rely on separately pre-trained vision and text encoders, resulting in a significant gap between the two modalities in terms of spatial geometry and semantic categories. This discrepancy often causes errors in object positioning and classification. The paper proposes UniSpace-3D, which innovatively introduces a unified representation space for 3DVG, effectively bridging the gap between visual and textual features. Specifically, UniSpace-3D incorporates three innovative designs: i) a unified representation encoder that leverages the pre-trained CLIP model to map visual and textual features into a unified representation space, effectively bridging the gap between the two modalities; ii) a multi-modal contrastive learning module that further reduces the modality gap; iii) a language-guided query selection module that utilizes the positional and semantic information to identify object candidate points aligned with textual descriptions. Extensive experiments demonstrate that UniSpace-3D outperforms baseline models by at least 2.24% on the ScanRefer and Nr3D/Sr3D datasets. The code will be made available upon acceptance of the paper.

Problem

Research questions and friction points this paper is trying to address.

Bridging gap between vision and text in 3DVG

Improving object positioning and classification accuracy

Unifying visual and textual feature representation space

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified representation space for 3DVG

Multi-modal contrastive learning module

Language-guided query selection module

🔎 Similar Papers

No similar papers found.

Authors to Follow