Mosaic3D: Foundation Dataset and Model for Open-Vocabulary 3D Segmentation

📅 2025-02-04

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

This work addresses open-vocabulary 3D scene understanding by proposing the first end-to-end trainable 3D foundation model and a corresponding large-scale benchmark. To overcome the scarcity of high-quality, large-scale 3D mask–text paired data—a key limitation of existing methods—we introduce Mosaic3D-5.6M, the first open-vocabulary 3D segmentation dataset with over 5.6 million samples. It is constructed via a custom automated pipeline integrating open-set 2D segmentation and region-aware vision-language models to generate precise 3D masks and semantically rich textual descriptions. The model employs a contrastive learning–driven 3D encoder jointly trained with a lightweight, promptable mask decoder. It achieves state-of-the-art performance on ScanNet200, Matterport3D, and ScanNet++, with ablation studies confirming that large-scale, high-fidelity mask–text pairs are the primary driver of performance gains.

Technology Category

Application Category

📝 Abstract

We tackle open-vocabulary 3D scene understanding by introducing a novel data generation pipeline and training framework. Our method addresses three critical requirements for effective training: precise 3D region segmentation, comprehensive textual descriptions, and sufficient dataset scale. By leveraging state-of-the-art open-vocabulary image segmentation models and region-aware Vision-Language Models, we develop an automatic pipeline that generates high-quality 3D mask-text pairs. Applying this pipeline to multiple 3D scene datasets, we create Mosaic3D-5.6M, a dataset of over 30K annotated scenes with 5.6M mask-text pairs, significantly larger than existing datasets. Building upon this data, we propose Mosaic3D, a foundation model combining a 3D encoder trained with contrastive learning and a lightweight mask decoder for open-vocabulary 3D semantic and instance segmentation. Our approach achieves state-of-the-art results on open-vocabulary 3D semantic and instance segmentation tasks including ScanNet200, Matterport3D, and ScanNet++, with ablation studies validating the effectiveness of our large-scale training data.

Problem

Research questions and friction points this paper is trying to address.

Open-vocabulary 3D scene understanding

High-quality 3D mask-text pairs generation

State-of-the-art 3D semantic and instance segmentation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Novel data generation pipeline

Region-aware Vision-Language Models

Contrastive learning 3D encoder

🔎 Similar Papers

Search3D: Hierarchical Open-Vocabulary 3D Segmentation