Contrastive Language-Colored Pointmap Pretraining for Unified 3D Scene Understanding

📅 2026-04-02

📈 Citations: 0

✨ Influential: 0

career value

150K/year

🤖 AI Summary

This work proposes UniScene3D, a unified and general-purpose 3D scene representation framework designed to support diverse downstream understanding tasks. Built upon a Transformer-based encoder, UniScene3D is pretrained by aligning multi-view color point maps with CLIP embeddings, thereby jointly modeling geometric structure and visual appearance. The method introduces two key innovations: cross-view geometric alignment and grounded-view alignment, which enhance geometric and semantic consistency across viewpoints and improve representation robustness. Evaluated under low-data regimes and task-specific fine-tuning settings, UniScene3D achieves state-of-the-art performance across multiple benchmarks, including viewpoint localization, scene retrieval, scene classification, and 3D visual question answering.

Technology Category

Application Category

📝 Abstract

Pretraining 3D encoders by aligning with Contrastive Language Image Pretraining (CLIP) has emerged as a promising direction to learn generalizable representations for 3D scene understanding. In this paper, we propose UniScene3D, a transformer-based encoder that learns unified scene representations from multi-view colored pointmaps, jointly modeling image appearance and geometry. For robust colored pointmap representation learning, we introduce novel cross-view geometric alignment and grounded view alignment to enforce cross-view geometry and semantic consistency. Extensive low-shot and task-specific fine-tuning evaluations on viewpoint grounding, scene retrieval, scene type classification, and 3D VQA demonstrate our state-of-the-art performance. These results highlight the effectiveness of our approach for unified 3D scene understanding. https://yebulabula.github.io/UniScene3D/

Problem

Research questions and friction points this paper is trying to address.

3D scene understanding

pretraining

unified representation

colored pointmap

contrastive learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

contrastive pretraining

colored pointmap

cross-view alignment