🤖 AI Summary
In 2D-to-3D lifting-based 3D instance segmentation, ambiguous semantic guidance and insufficient depth constraints lead to error accumulation. To address this, we propose a novel “segment-then-grow” framework. Our method first performs 2D instance segmentation and then lifts masks into 3D via geometric growth guided by scene structure. Key contributions include: (1) a training-free mask filtering mechanism driven by co-occurrence statistics of 3D geometric primitives, which suppresses semantically ambiguous regions; and (2) joint boundary completion and geometry-consistent growth under spatial continuity priors and high-level semantic features, enabling robust instance delineation despite semantic confusion. Crucially, our approach achieves multi-level semantic–geometric co-optimization without modifying pre-trained models. Experiments on ScanNet200, ScanNet++, and KITTI-360 demonstrate significant improvements in mAP and boundary completeness. The method exhibits strong generalization and robustness across diverse indoor and outdoor scenes.
📝 Abstract
Accurate 3D instance segmentation is crucial for high-quality scene understanding in the 3D vision domain. However, 3D instance segmentation based on 2D-to-3D lifting approaches struggle to produce precise instance-level segmentation, due to accumulated errors introduced during the lifting process from ambiguous semantic guidance and insufficient depth constraints. To tackle these challenges, we propose splitting and growing reliable semantic mask for high-fidelity 3D instance segmentation (SGS-3D), a novel "split-then-grow" framework that first purifies and splits ambiguous lifted masks using geometric primitives, and then grows them into complete instances within the scene. Unlike existing approaches that directly rely on raw lifted masks and sacrifice segmentation accuracy, SGS-3D serves as a training-free refinement method that jointly fuses semantic and geometric information, enabling effective cooperation between the two levels of representation. Specifically, for semantic guidance, we introduce a mask filtering strategy that leverages the co-occurrence of 3D geometry primitives to identify and remove ambiguous masks, thereby ensuring more reliable semantic consistency with the 3D object instances. For the geometric refinement, we construct fine-grained object instances by exploiting both spatial continuity and high-level features, particularly in the case of semantic ambiguity between distinct objects. Experimental results on ScanNet200, ScanNet++, and KITTI-360 demonstrate that SGS-3D substantially improves segmentation accuracy and robustness against inaccurate masks from pre-trained models, yielding high-fidelity object instances while maintaining strong generalization across diverse indoor and outdoor environments. Code is available in the supplementary materials.