GaussianCross: Cross-modal Self-supervised 3D Representation Learning via Gaussian Splatting

๐Ÿ“… 2025-08-04
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address model collapse and weak structural representation of point clouds in 3D self-supervised pretraining, this paper proposes a cross-modal self-supervised framework based on Gaussian splatting. Our method introduces: (1) a unified Gaussian representation with cubic normalization, explicitly encoding geometric, appearance, and semantic attributes; and (2) a tri-attribute adaptive distillation module that enforces fine-grained feature alignment and consistency via cross-modal contrastive learning. The framework requires only feed-forward inference and linear probing for evaluationโ€”no auxiliary decoders or complex architectures are needed. Evaluated on ScanNet, ScanNet200, and S3DIS, our approach achieves state-of-the-art performance using merely 0.1% of the parameters and 1% of the training data: semantic segmentation mIoU improves by 9.3%, and instance segmentation AP$_{50}$ rises by 6.1%, demonstrating significantly enhanced generalization and training stability.

Technology Category

Application Category

๐Ÿ“ Abstract
The significance of informative and robust point representations has been widely acknowledged for 3D scene understanding. Despite existing self-supervised pre-training counterparts demonstrating promising performance, the model collapse and structural information deficiency remain prevalent due to insufficient point discrimination difficulty, yielding unreliable expressions and suboptimal performance. In this paper, we present GaussianCross, a novel cross-modal self-supervised 3D representation learning architecture integrating feed-forward 3D Gaussian Splatting (3DGS) techniques to address current challenges. GaussianCross seamlessly converts scale-inconsistent 3D point clouds into a unified cuboid-normalized Gaussian representation without missing details, enabling stable and generalizable pre-training. Subsequently, a tri-attribute adaptive distillation splatting module is incorporated to construct a 3D feature field, facilitating synergetic feature capturing of appearance, geometry, and semantic cues to maintain cross-modal consistency. To validate GaussianCross, we perform extensive evaluations on various benchmarks, including ScanNet, ScanNet200, and S3DIS. In particular, GaussianCross shows a prominent parameter and data efficiency, achieving superior performance through linear probing (<0.1% parameters) and limited data training (1% of scenes) compared to state-of-the-art methods. Furthermore, GaussianCross demonstrates strong generalization capabilities, improving the full fine-tuning accuracy by 9.3% mIoU and 6.1% AP$_{50}$ on ScanNet200 semantic and instance segmentation tasks, respectively, supporting the effectiveness of our approach. The code, weights, and visualizations are publicly available at href{https://rayyoh.github.io/GaussianCross/}{https://rayyoh.github.io/GaussianCross/}.
Problem

Research questions and friction points this paper is trying to address.

Addresses model collapse in 3D self-supervised learning
Improves structural information in point cloud representations
Enhances cross-modal consistency in 3D feature learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses 3D Gaussian Splatting for representation learning
Converts point clouds to unified Gaussian representation
Tri-attribute distillation for cross-modal feature capturing
๐Ÿ”Ž Similar Papers
No similar papers found.
L
Lei Yao
Hong Kong Polytechnic University
Y
Yi Wang
Hong Kong Polytechnic University
Y
Yi Zhang
Hong Kong Polytechnic University
Moyun Liu
Moyun Liu
Huazhong University of Science and Technology
Embodied AIComputer Vision
Lap-Pui Chau
Lap-Pui Chau
The Hong Kong Polytechnic University
Visual Signal Processing