SNAP: Towards Segmenting Anything in Any Point Cloud

📅 2025-10-13

📈 Citations: 0

✨ Influential: 0

career value

160K/year

🤖 AI Summary

Existing interactive 3D point cloud segmentation methods are constrained by single-scene-domain applicability (e.g., indoor-only or outdoor-only) and single-modality interaction (point prompts only or text prompts only); moreover, multi-dataset joint training often induces negative transfer, severely limiting generalization. This paper proposes the first universal point cloud segmentation framework supporting dual-modal interaction—both point and text prompts. To mitigate negative transfer, we introduce domain-adaptive normalization. Furthermore, we integrate CLIP-based text embedding matching to automatically generate mask proposals, enabling open-vocabulary understanding and panoptic segmentation. The model is jointly trained on seven cross-domain datasets and achieves state-of-the-art performance on eight of nine zero-shot spatial prompting benchmarks, while remaining competitive across all five text-prompting benchmarks. Our approach significantly enhances cross-domain and cross-modal generalization, as well as practical usability.

Technology Category

Application Category

📝 Abstract

Interactive 3D point cloud segmentation enables efficient annotation of complex 3D scenes through user-guided prompts. However, current approaches are typically restricted in scope to a single domain (indoor or outdoor), and to a single form of user interaction (either spatial clicks or textual prompts). Moreover, training on multiple datasets often leads to negative transfer, resulting in domain-specific tools that lack generalizability. To address these limitations, we present extbf{SNAP} ( extbf{S}egment a extbf{N}ything in extbf{A}ny extbf{P}oint cloud), a unified model for interactive 3D segmentation that supports both point-based and text-based prompts across diverse domains. Our approach achieves cross-domain generalizability by training on 7 datasets spanning indoor, outdoor, and aerial environments, while employing domain-adaptive normalization to prevent negative transfer. For text-prompted segmentation, we automatically generate mask proposals without human intervention and match them against CLIP embeddings of textual queries, enabling both panoptic and open-vocabulary segmentation. Extensive experiments demonstrate that SNAP consistently delivers high-quality segmentation results. We achieve state-of-the-art performance on 8 out of 9 zero-shot benchmarks for spatial-prompted segmentation and demonstrate competitive results on all 5 text-prompted benchmarks. These results show that a unified model can match or exceed specialized domain-specific approaches, providing a practical tool for scalable 3D annotation. Project page is at, https://neu-vi.github.io/SNAP/

Problem

Research questions and friction points this paper is trying to address.

Unifying point-based and text-based prompts for 3D segmentation

Overcoming negative transfer across multiple datasets and domains

Enabling cross-domain generalizability for interactive point cloud segmentation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified model supports point and text prompts

Domain-adaptive normalization prevents negative transfer

Automated mask proposals match CLIP text embeddings

🔎 Similar Papers

No similar papers found.