Contrastive Language-Colored Pointmap Pretraining for Unified 3D Scene Understanding

📅 2026-04-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes UniScene3D, a unified and general-purpose 3D scene representation framework designed to support diverse downstream understanding tasks. Built upon a Transformer-based encoder, UniScene3D is pretrained by aligning multi-view color point maps with CLIP embeddings, thereby jointly modeling geometric structure and visual appearance. The method introduces two key innovations: cross-view geometric alignment and grounded-view alignment, which enhance geometric and semantic consistency across viewpoints and improve representation robustness. Evaluated under low-data regimes and task-specific fine-tuning settings, UniScene3D achieves state-of-the-art performance across multiple benchmarks, including viewpoint localization, scene retrieval, scene classification, and 3D visual question answering.
📝 Abstract
Pretraining 3D encoders by aligning with Contrastive Language Image Pretraining (CLIP) has emerged as a promising direction to learn generalizable representations for 3D scene understanding. In this paper, we propose UniScene3D, a transformer-based encoder that learns unified scene representations from multi-view colored pointmaps, jointly modeling image appearance and geometry. For robust colored pointmap representation learning, we introduce novel cross-view geometric alignment and grounded view alignment to enforce cross-view geometry and semantic consistency. Extensive low-shot and task-specific fine-tuning evaluations on viewpoint grounding, scene retrieval, scene type classification, and 3D VQA demonstrate our state-of-the-art performance. These results highlight the effectiveness of our approach for unified 3D scene understanding. https://yebulabula.github.io/UniScene3D/
Problem

Research questions and friction points this paper is trying to address.

3D scene understanding
pretraining
unified representation
colored pointmap
contrastive learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

contrastive pretraining
colored pointmap
cross-view alignment
unified 3D representation
CLIP-aligned 3D encoder
🔎 Similar Papers
No similar papers found.