SAS: Segment Any 3D Scene with Integrated 2D Priors

📅 2025-03-11

📈 Citations: 0

✨ Influential: 0

career value

165K/year

🤖 AI Summary

To address the limitation of conventional fixed-vocabulary models in open-vocabulary 3D scene segmentation—which fail to recognize unseen object categories—this paper introduces the first end-to-end framework for transferring multi-source 2D visual priors to 3D point clouds. Methodologically: (1) we propose a novel text-driven embedding alignment mechanism across multiple pre-trained 2D vision models to ensure cross-model semantic consistency; (2) we design a diffusion-based, annotation-free method to quantify and rank the open-vocabulary capabilities of 2D models; and (3) we integrate geometric-semantic point cloud features and perform cross-dimensional feature distillation to effectively transfer open-vocabulary recognition capacity from 2D to 3D. Our approach achieves significant improvements over state-of-the-art methods on ScanNet v2, Matterport3D, and nuScenes. Moreover, it demonstrates strong generalization in downstream tasks including Gaussian splatting-based segmentation and zero-shot instance segmentation.

Technology Category

Application Category

📝 Abstract

The open vocabulary capability of 3D models is increasingly valued, as traditional methods with models trained with fixed categories fail to recognize unseen objects in complex dynamic 3D scenes. In this paper, we propose a simple yet effective approach, SAS, to integrate the open vocabulary capability of multiple 2D models and migrate it to 3D domain. Specifically, we first propose Model Alignment via Text to map different 2D models into the same embedding space using text as a bridge. Then, we propose Annotation-Free Model Capability Construction to explicitly quantify the 2D model's capability of recognizing different categories using diffusion models. Following this, point cloud features from different 2D models are fused with the guide of constructed model capabilities. Finally, the integrated 2D open vocabulary capability is transferred to 3D domain through feature distillation. SAS outperforms previous methods by a large margin across multiple datasets, including ScanNet v2, Matterport3D, and nuScenes, while its generalizability is further validated on downstream tasks, e.g., gaussian segmentation and instance segmentation.

Problem

Research questions and friction points this paper is trying to address.

Integrate 2D open vocabulary into 3D scene segmentation

Map 2D models to same embedding space using text

Transfer 2D model capabilities to 3D via feature distillation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Model Alignment via Text for embedding space unification

Annotation-Free Model Capability Construction using diffusion models

Feature distillation for 2D to 3D open vocabulary transfer

🔎 Similar Papers

No similar papers found.