🤖 AI Summary
Existing 3D segmentation methods are largely confined to object-level recognition, failing to support open-vocabulary search for fine-grained semantics such as parts and materials. This work introduces the first hierarchical open-vocabulary 3D segmentation framework, breaking away from the object-centric paradigm to enable text-driven, multi-granularity 3D localization—spanning objects, parts, and materials. Methodologically, we integrate CLIP and Point-BERT, designing multimodal feature alignment, hierarchical point cloud semantic disentanglement, and cross-granularity text–geometry matching. Contributions include: (i) establishing the first scene-scale 3D part segmentation benchmark (MultiScan-based) and enriching ScanNet++ with fine-grained part annotations; (ii) enabling, for the first time, attribute-level, non-object-centric text queries. Experiments demonstrate significant gains over baselines on our part segmentation benchmark while maintaining state-of-the-art performance on object- and material-level tasks, supporting end-to-end text-to-3D-mask prediction.
📝 Abstract
Open-vocabulary 3D segmentation enables exploration of 3D spaces using free-form text descriptions. Existing methods for open-vocabulary 3D instance segmentation primarily focus on identifying object-level instances but struggle with finer-grained scene entities such as object parts, or regions described by generic attributes. In this work, we introduce Search3D, an approach to construct hierarchical open-vocabulary 3D scene representations, enabling 3D search at multiple levels of granularity: fine-grained object parts, entire objects, or regions described by attributes like materials. Unlike prior methods, Search3D shifts towards a more flexible open-vocabulary 3D search paradigm, moving beyond explicit object-centric queries. For systematic evaluation, we further contribute a scene-scale open-vocabulary 3D part segmentation benchmark based on MultiScan, along with a set of open-vocabulary fine-grained part annotations on ScanNet++. Search3D outperforms baselines in scene-scale open-vocabulary 3D part segmentation, while maintaining strong performance in segmenting 3D objects and materials. Our project page is http://search3d-segmentation.github.io.