๐ค AI Summary
Existing 3D semantic segmentation methods heavily rely on 2D priors, exhibiting poor generalization across scenes, domains, and novel viewpoints. To address this, we propose an open-vocabulary-driven universal 3D Gaussian semantic segmentation framework that enables zero-shot 3D semantic understanding beyond single-scene knowledge transfer. Our method introduces three key innovations: (1) Generalized Semantic Rasterization (GSR), enabling fine-grained, point-level semantic rendering; (2) Cross-Modal Consistency Learning (CCL), jointly enforcing multi-view geometric constraints and open-vocabulary vision-language alignment; and (3) SegGaussianโthe first publicly available 3D Gaussian dataset with dense semantic and instance annotations. Extensive experiments demonstrate substantial improvements over state-of-the-art baselines on cross-scene, cross-domain, and novel-view segmentation tasks, while supporting arbitrary text-query-based zero-shot inference. Both code and SegGaussian are open-sourced.
๐ Abstract
Open-vocabulary scene understanding using 3D Gaussian (3DGS) representations has garnered considerable attention. However, existing methods mostly lift knowledge from large 2D vision models into 3DGS on a scene-by-scene basis, restricting the capabilities of open-vocabulary querying within their training scenes so that lacking the generalizability to novel scenes. In this work, we propose extbf{OVGaussian}, a generalizable extbf{O}pen- extbf{V}ocabulary 3D semantic segmentation framework based on the 3D extbf{Gaussian} representation. We first construct a large-scale 3D scene dataset based on 3DGS, dubbed extbf{SegGaussian}, which provides detailed semantic and instance annotations for both Gaussian points and multi-view images. To promote semantic generalization across scenes, we introduce Generalizable Semantic Rasterization (GSR), which leverages a 3D neural network to learn and predict the semantic property for each 3D Gaussian point, where the semantic property can be rendered as multi-view consistent 2D semantic maps. In the next, we propose a Cross-modal Consistency Learning (CCL) framework that utilizes open-vocabulary annotations of 2D images and 3D Gaussians within SegGaussian to train the 3D neural network capable of open-vocabulary semantic segmentation across Gaussian-based 3D scenes. Experimental results demonstrate that OVGaussian significantly outperforms baseline methods, exhibiting robust cross-scene, cross-domain, and novel-view generalization capabilities. Code and the SegGaussian dataset will be released. (https://github.com/runnanchen/OVGaussian).