OVGaussian: Generalizable 3D Gaussian Segmentation with Open Vocabularies

📅 2024-12-31

📈 Citations: 0

✨ Influential: 0

career value

161K/year

🤖 AI Summary

Existing 3D semantic segmentation methods heavily rely on 2D priors, exhibiting poor generalization across scenes, domains, and novel viewpoints. To address this, we propose an open-vocabulary-driven universal 3D Gaussian semantic segmentation framework that enables zero-shot 3D semantic understanding beyond single-scene knowledge transfer. Our method introduces three key innovations: (1) Generalized Semantic Rasterization (GSR), enabling fine-grained, point-level semantic rendering; (2) Cross-Modal Consistency Learning (CCL), jointly enforcing multi-view geometric constraints and open-vocabulary vision-language alignment; and (3) SegGaussian—the first publicly available 3D Gaussian dataset with dense semantic and instance annotations. Extensive experiments demonstrate substantial improvements over state-of-the-art baselines on cross-scene, cross-domain, and novel-view segmentation tasks, while supporting arbitrary text-query-based zero-shot inference. Both code and SegGaussian are open-sourced.

Technology Category

Application Category

📝 Abstract

Open-vocabulary scene understanding using 3D Gaussian (3DGS) representations has garnered considerable attention. However, existing methods mostly lift knowledge from large 2D vision models into 3DGS on a scene-by-scene basis, restricting the capabilities of open-vocabulary querying within their training scenes so that lacking the generalizability to novel scenes. In this work, we propose extbf{OVGaussian}, a generalizable extbf{O}pen- extbf{V}ocabulary 3D semantic segmentation framework based on the 3D extbf{Gaussian} representation. We first construct a large-scale 3D scene dataset based on 3DGS, dubbed extbf{SegGaussian}, which provides detailed semantic and instance annotations for both Gaussian points and multi-view images. To promote semantic generalization across scenes, we introduce Generalizable Semantic Rasterization (GSR), which leverages a 3D neural network to learn and predict the semantic property for each 3D Gaussian point, where the semantic property can be rendered as multi-view consistent 2D semantic maps. In the next, we propose a Cross-modal Consistency Learning (CCL) framework that utilizes open-vocabulary annotations of 2D images and 3D Gaussians within SegGaussian to train the 3D neural network capable of open-vocabulary semantic segmentation across Gaussian-based 3D scenes. Experimental results demonstrate that OVGaussian significantly outperforms baseline methods, exhibiting robust cross-scene, cross-domain, and novel-view generalization capabilities. Code and the SegGaussian dataset will be released. (https://github.com/runnanchen/OVGaussian).

Problem

Research questions and friction points this paper is trying to address.

3D Image Analysis

2D Image Learning

3D Scene Segmentation

Innovation

Methods, ideas, or system contributions that make the work stand out.

3D Gaussian Representation

CCL Training Method

GSR System

🔎 Similar Papers

Segment Any 3D Gaussians