🤖 AI Summary
Existing methods naively append high-dimensional CLIP embeddings to 3D Gaussians and rely on view-inconsistent 2D CLIP supervision, leading to semantic ambiguity and computational inefficiency. This paper proposes CLIP-GS, the first framework to intrinsically embed CLIP semantic priors into the 3D Gaussian parameterization, enabling label-free, real-time, view-consistent open-vocabulary 3D semantic understanding. Key contributions include: (1) a Semantic Attribute Compactness (SAC) module that drastically compresses the dimensionality of Gaussian semantic representations; and (2) a 3D-Consistent Self-training (3DCS) strategy that eliminates view conflicts induced by 2D supervision via multi-view consistency constraints and self-supervised pseudo-labeling. Evaluated on Replica and ScanNet, CLIP-GS achieves absolute mIoU gains of 17.29% and 20.81%, respectively, renders at over 100 FPS, and demonstrates robustness to sparse input views.
📝 Abstract
The recent 3D Gaussian Splatting (GS) exhibits high-quality and real-time synthesis of novel views in 3D scenes. Currently, it primarily focuses on geometry and appearance modeling, while lacking the semantic understanding of scenes. To bridge this gap, we present CLIP-GS, which integrates semantics from Contrastive Language-Image Pre-Training (CLIP) into Gaussian Splatting to efficiently comprehend 3D environments without annotated semantic data. In specific, rather than straightforwardly learning and rendering high-dimensional semantic features of 3D Gaussians, which significantly diminishes the efficiency, we propose a Semantic Attribute Compactness (SAC) approach. SAC exploits the inherent unified semantics within objects to learn compact yet effective semantic representations of 3D Gaussians, enabling highly efficient rendering (>100 FPS). Additionally, to address the semantic ambiguity, caused by utilizing view-inconsistent 2D CLIP semantics to supervise Gaussians, we introduce a 3D Coherent Self-training (3DCS) strategy, resorting to the multi-view consistency originated from the 3D model. 3DCS imposes cross-view semantic consistency constraints by leveraging refined, self-predicted pseudo-labels derived from the trained 3D Gaussian model, thereby enhancing precise and view-consistent segmentation results. Extensive experiments demonstrate that our method remarkably outperforms existing state-of-the-art approaches, achieving improvements of 17.29% and 20.81% in mIoU metric on Replica and ScanNet datasets, respectively, while maintaining real-time rendering speed. Furthermore, our approach exhibits superior performance even with sparse input data, verifying the robustness of our method.