SpaCeFormer: Fast Proposal-Free Open-Vocabulary 3D Instance Segmentation

📅 2026-04-22

📈 Citations: 0

✨ Influential: 0

career value

163K/year

🤖 AI Summary

This work addresses the limitations of existing methods for open-vocabulary 3D instance segmentation, which often rely on external region proposals or suffer from slow inference. The authors propose SpaCeFormer, the first end-to-end, proposal-free framework that achieves high efficiency and strong zero-shot performance. SpaCeFormer integrates spatial window attention, Morton curve serialization, a RoPE-enhanced decoder, and multi-view mask clustering. Leveraging vision-language models for automatic annotation, the authors also introduce SpaCeFormer-3M, the largest dataset to date for this task, containing 3 million instances. On ScanNet200, SpaCeFormer achieves a zero-shot mAP of 11.1—2.8× higher than the best prior proposal-free method—and reports mAP scores of 22.9 and 24.1 on ScanNet++ and Replica, respectively. With an inference time of only 0.14 seconds per scene, it significantly improves mask recall and zero-shot generalization.

Technology Category

Application Category

📝 Abstract

Open-vocabulary 3D instance segmentation is a core capability for robotics and AR/VR, but prior methods trade one bottleneck for another: multi-stage 2D+3D pipelines aggregate foundation-model outputs at hundreds of seconds per scene, while pseudo-labeled end-to-end approaches rely on fragmented masks and external region proposals. We present SpaCeFormer, a proposal-free space-curve transformer that runs at 0.14 seconds per scene, 2-3 orders of magnitude faster than multi-stage 2D+3D pipelines. We pair it with SpaCeFormer-3M, the largest open-vocabulary 3D instance segmentation dataset (3.0M multi-view-consistent captions over 604K instances from 7.4K scenes) built through multi-view mask clustering and multi-view VLM captioning; it reaches 21x higher mask recall than prior single-view pipelines (54.3% vs 2.5% at IoU > 0.5). SpaCeFormer combines spatial window attention with Morton-curve serialization for spatially coherent features, and uses a RoPE-enhanced decoder to predict instance masks directly from learned queries without external proposals. On ScanNet200 we achieve 11.1 zero-shot mAP, a 2.8x improvement over the prior best proposal-free method; on ScanNet++ and Replica, we reach 22.9 and 24.1 mAP, surpassing all prior methods including those using multi-view 2D inputs.

Problem

Research questions and friction points this paper is trying to address.

open-vocabulary

3D instance segmentation

proposal-free

efficiency bottleneck

spatial coherence

Innovation

Methods, ideas, or system contributions that make the work stand out.

proposal-free

open-vocabulary 3D instance segmentation

space-curve transformer