CLIP-GS: Unifying Vision-Language Representation with 3D Gaussian Splatting

📅 2024-12-26

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

Existing 3D world learning methods rely heavily on point clouds, which struggle to capture fine-grained texture details, thereby limiting geometric fidelity and cross-modal understanding. To address this, we introduce 3D Gaussian Splatting (3DGS) into vision-language joint modeling for the first time, establishing a unified embedding space spanning 3D geometry, images, and text. We propose a GS Tokenizer for sequence-based 3DGS encoding, an image-voting loss to guide gradient optimization, and the first 3DGS–image–text triplet generation paradigm. Our method integrates CLIP-based contrastive learning, a Transformer architecture with weights transferred from point cloud models, and a joint loss objective. Evaluated on multimodal retrieval and zero-/few-shot 3D classification, it significantly outperforms point cloud baselines, yielding a unified 3D representation with enhanced texture awareness, high geometric fidelity, and superior cross-modal alignment.

Technology Category

Application Category

📝 Abstract

Recent works in 3D multimodal learning have made remarkable progress. However, typically 3D multimodal models are only capable of handling point clouds. Compared to the emerging 3D representation technique, 3D Gaussian Splatting (3DGS), the spatially sparse point cloud cannot depict the texture information of 3D objects, resulting in inferior reconstruction capabilities. This limitation constrains the potential of point cloud-based 3D multimodal representation learning. In this paper, we present CLIP-GS, a novel multimodal representation learning framework grounded in 3DGS. We introduce the GS Tokenizer to generate serialized gaussian tokens, which are then processed through transformer layers pre-initialized with weights from point cloud models, resulting in the 3DGS embeddings. CLIP-GS leverages contrastive loss between 3DGS and the visual-text embeddings of CLIP, and we introduce an image voting loss to guide the directionality and convergence of gradient optimization. Furthermore, we develop an efficient way to generate triplets of 3DGS, images, and text, facilitating CLIP-GS in learning unified multimodal representations. Leveraging the well-aligned multimodal representations, CLIP-GS demonstrates versatility and outperforms point cloud-based models on various 3D tasks, including multimodal retrieval, zero-shot, and few-shot classification.

Problem

Research questions and friction points this paper is trying to address.

3D world learning

point cloud limitations

texture detail capture

Innovation

Methods, ideas, or system contributions that make the work stand out.

3D Gaussian Splatting

Transformer Networks

Contrastive Learning

🔎 Similar Papers

No similar papers found.