PointSeg: A Training-Free Paradigm for 3D Scene Segmentation via Foundation Models

📅 2024-03-11

🏛️ arXiv.org

📈 Citations: 5

✨ Influential: 0

career value

195K/year

🤖 AI Summary

This work addresses the challenge of performing 3D scene segmentation without training 3D foundation models. We propose the first training-free paradigm for 3D point cloud segmentation. Our method leverages off-the-shelf vision foundation models (e.g., SAM, CLIP) and constructs 3D point–bounding-box prompt pairs via cross-frame pixel alignment. We introduce a dual-branch prompt learning architecture, a bidirectional matching strategy, and an affinity-aware mask fusion algorithm, supporting plug-and-play integration of multiple vision models. On benchmarks including ScanNet, our approach achieves mAP gains of 12.3–14.1% over existing training-free methods and outperforms supervised state-of-the-art methods by 3.4–5.4%. Our core contribution is the first general-purpose, zero-training 3D segmentation framework—overcoming a critical bottleneck in adapting vision foundation models to 3D tasks.

Technology Category

Application Category

📝 Abstract

Recent success of vision foundation models have shown promising performance for the 2D perception tasks. However, it is difficult to train a 3D foundation network directly due to the limited dataset and it remains under explored whether existing foundation models can be lifted to 3D space seamlessly. In this paper, we present PointSeg, a novel training-free paradigm that leverages off-the-shelf vision foundation models to address 3D scene perception tasks. PointSeg can segment anything in 3D scene by acquiring accurate 3D prompts to align their corresponding pixels across frames. Concretely, we design a two-branch prompts learning structure to construct the 3D point-box prompts pairs, combining with the bidirectional matching strategy for accurate point and proposal prompts generation. Then, we perform the iterative post-refinement adaptively when cooperated with different vision foundation models. Moreover, we design a affinity-aware merging algorithm to improve the final ensemble masks. PointSeg demonstrates impressive segmentation performance across various datasets, all without training. Specifically, our approach significantly surpasses the state-of-the-art specialist training-free model by 14.1$%$, 12.3$%$, and 12.6$%$ mAP on ScanNet, ScanNet++, and KITTI-360 datasets, respectively. On top of that, PointSeg can incorporate with various foundation models and even surpasses the specialist training-based methods by 3.4$%$-5.4$%$ mAP across various datasets, serving as an effective generalist model.

Problem

Research questions and friction points this paper is trying to address.

Leverage 2D foundation models for 3D scene segmentation

Generate accurate 3D prompts without training

Improve segmentation performance across diverse datasets

Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free paradigm using vision foundation models

Two-branch prompts learning for 3D alignment

Affinity-aware merging algorithm for improved masks

🔎 Similar Papers

No similar papers found.