Vocabulary-Free 3D Instance Segmentation with Vision and Language Assistant

📅 2024-08-20

🏛️ arXiv.org

📈 Citations: 2

✨ Influential: 0

career value

160K/year

🤖 AI Summary

Existing 3D instance segmentation methods rely on predefined category vocabularies, limiting their ability to handle open-vocabulary queries such as “list all objects in the scene.” This work introduces the first vocabulary-free 3D instance segmentation paradigm: it requires no category priors and instead leverages a vision-language assistant to guide an open-vocabulary 2D segmenter—across multi-view images—to autonomously discover and ground semantic categories. Instance masks from multiple views are then fused into 3D via superpoint partitioning and spectral clustering. Crucially, we propose a novel superpoint merging strategy that jointly optimizes mask consistency and semantic coherence. Evaluated on ScanNet200 and Replica, our method achieves state-of-the-art performance under both vocabulary-free and open-vocabulary settings, significantly enhancing the generalizability and openness of 3D scene understanding.

Technology Category

Application Category

📝 Abstract

Most recent 3D instance segmentation methods are open vocabulary, offering a greater flexibility than closed-vocabulary methods. Yet, they are limited to reasoning within a specific set of concepts, ie the vocabulary, prompted by the user at test time. In essence, these models cannot reason in an open-ended fashion, i.e., answering"List the objects in the scene.''. We introduce the first method to address 3D instance segmentation in a setting that is void of any vocabulary prior, namely a vocabulary-free setting. We leverage a large vision-language assistant and an open-vocabulary 2D instance segmenter to discover and ground semantic categories on the posed images. To form 3D instance mask, we first partition the input point cloud into dense superpoints, which are then merged into 3D instance masks. We propose a novel superpoint merging strategy via spectral clustering, accounting for both mask coherence and semantic coherence that are estimated from the 2D object instance masks. We evaluate our method using ScanNet200 and Replica, outperforming existing methods in both vocabulary-free and open-vocabulary settings. Code will be made available. Project page: https://gfmei.github.io/PoVo

Problem

Research questions and friction points this paper is trying to address.

Addresses 3D instance segmentation without predefined vocabulary constraints

Leverages vision-language models for open-ended semantic category discovery

Proposes spectral clustering for merging superpoints into 3D instance masks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages vision-language assistant for semantic discovery

Uses superpoint merging via spectral clustering

Combines mask and semantic coherence for 3D segmentation

🔎 Similar Papers

Open-Ended 3D Point Cloud Instance Segmentation