๐ค AI Summary
This work addresses the challenges of scarce manual annotations and poor performance on rare categories in open-vocabulary 3D instance segmentation. We propose OVSeg3R, a framework that transfers mature 2D open-vocabulary instance segmentation models to 3D scenes via self-supervised 3D reconstruction. Our key contributions include a view-level instance partitioning algorithm and 2D boundary-aware superpoint clustering, which jointly suppress pseudo-label noise and preserve fine-grained geometric detailsโthereby significantly improving segmentation accuracy for long-tail categories. The method integrates multi-view reconstruction, 2D-to-3D feature projection, open-vocabulary 2D segmentation, and geometry-constrained clustering to enable end-to-end, self-supervised 3D instance annotation generation. On ScanNet200, OVSeg3R achieves a new state-of-the-art mAP, with overall gains of +2.3 and +7.1 for novel categories, substantially narrowing the long-tail performance gap.
๐ Abstract
In this paper, we propose a training scheme called OVSeg3R to learn open-vocabulary 3D instance segmentation from well-studied 2D perception models with the aid of 3D reconstruction. OVSeg3R directly adopts reconstructed scenes from 2D videos as input, avoiding costly manual adjustment while aligning input with real-world applications. By exploiting the 2D to 3D correspondences provided by 3D reconstruction models, OVSeg3R projects each view's 2D instance mask predictions, obtained from an open-vocabulary 2D model, onto 3D to generate annotations for the view's corresponding sub-scene. To avoid incorrectly introduced false positives as supervision due to partial annotations from 2D to 3D, we propose a View-wise Instance Partition algorithm, which partitions predictions to their respective views for supervision, stabilizing the training process. Furthermore, since 3D reconstruction models tend to over-smooth geometric details, clustering reconstructed points into representative super-points based solely on geometry, as commonly done in mainstream 3D segmentation methods, may overlook geometrically non-salient objects. We therefore introduce 2D Instance Boundary-aware Superpoint, which leverages 2D masks to constrain the superpoint clustering, preventing superpoints from violating instance boundaries. With these designs, OVSeg3R not only extends a state-of-the-art closed-vocabulary 3D instance segmentation model to open-vocabulary, but also substantially narrows the performance gap between tail and head classes, ultimately leading to an overall improvement of +2.3 mAP on the ScanNet200 benchmark. Furthermore, under the standard open-vocabulary setting, OVSeg3R surpasses previous methods by about +7.1 mAP on the novel classes, further validating its effectiveness.