UniGS: Unified Language-Image-3D Pretraining with Gaussian Splatting

📅 2025-02-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing point-cloud-based multimodal 3D pretraining methods suffer from misalignment between discrete 3D representations and continuous 2D image pixels, hindering fine-grained cross-modal alignment. To address this, we introduce differentiable, continuous, and pixel-aligned 3D Gaussian Splatting (3DGS) into language–image–3D joint modeling—the first such effort. We propose a unified multimodal pretraining framework driven by 3DGS and design a Gaussian-aware guidance module to enhance fine-grained 3D feature extraction and cross-modal alignment. Our method integrates 3DGS representation learning, vision-language model (VLM) transfer, 3D encoder alignment, contrastive learning, and cross-modal distillation. Evaluated on four major benchmarks—including Objaverse—our approach achieves gains of +9.36% in zero-shot classification, +4.3% in text-driven retrieval, and +7.92% in open-world understanding, consistently outperforming state-of-the-art methods such as Uni3D.

Technology Category

Application Category

📝 Abstract
Recent advancements in multi-modal 3D pre-training methods have shown promising efficacy in learning joint representations of text, images, and point clouds. However, adopting point clouds as 3D representation fails to fully capture the intricacies of the 3D world and exhibits a noticeable gap between the discrete points and the dense 2D pixels of images. To tackle this issue, we propose UniGS, integrating 3D Gaussian Splatting (3DGS) into multi-modal pre-training to enhance the 3D representation. We first rely on the 3DGS representation to model the 3D world as a collection of 3D Gaussians with color and opacity, incorporating all the information of the 3D scene while establishing a strong connection with 2D images. Then, to achieve Language-Image-3D pertaining, UniGS starts with a pre-trained vision-language model to establish a shared visual and textual space through extensive real-world image-text pairs. Subsequently, UniGS employs a 3D encoder to align the optimized 3DGS with the Language-Image representations to learn unified multi-modal representations. To facilitate the extraction of global explicit 3D features by the 3D encoder and achieve better cross-modal alignment, we additionally introduce a novel Gaussian-Aware Guidance module that guides the learning of fine-grained representations of the 3D domain. Through extensive experiments across the Objaverse, ABO, MVImgNet and SUN RGBD datasets with zero-shot classification, text-driven retrieval and open-world understanding tasks, we demonstrate the effectiveness of UniGS in learning a more general and stronger aligned multi-modal representation. Specifically, UniGS achieves leading results across different 3D tasks with remarkable improvements over previous SOTA, Uni3D, including on zero-shot classification (+9.36%), text-driven retrieval (+4.3%) and open-world understanding (+7.92%).
Problem

Research questions and friction points this paper is trying to address.

Improves 3D representation in multi-modal pre-training.
Integrates 3D Gaussian Splatting for better 3D modeling.
Enhances cross-modal alignment between text, images, and 3D.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates 3D Gaussian Splatting for enhanced representation
Uses pre-trained vision-language model for shared space
Introduces Gaussian-Aware Guidance for fine-grained learning
🔎 Similar Papers
No similar papers found.
H
Haoyuan Li
Shenzhen campus of Sun Yat-sen University
Yanpeng Zhou
Yanpeng Zhou
NOAH'S ARK LAB
T
Tao Tang
Shenzhen campus of Sun Yat-sen University
Jifei Song
Jifei Song
Huawei Noah’s Ark Lab
Neural RenderingComputer VisionDeep LearningImage ProcessingSpeech Processing
Y
Yihan Zeng
Huawei Noah’s Ark Lab
M
Michael C. Kampffmeyer
UiT The Arctic University of Norway
H
Hang Xu
Huawei Noah’s Ark Lab
Xiaodan Liang
Xiaodan Liang
Professor of Computer Science, Sun Yat-sen University, MBZUAI, CMU, NUS
Computer visionEmbodied AIMachine learning