MV3DIS: Multi-View Mask Matching via 3D Guides for Zero-Shot 3D Instance Segmentation

📅 2026-04-10

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

Existing zero-shot 3D instance segmentation methods often yield fragmented results due to their neglect of multi-view consistency and 3D geometric priors. This work proposes a coarse-to-fine zero-shot segmentation framework that first leverages coarse 3D fragments as a shared reference for cross-view matching, enhancing consistency through 3D-guided multi-view 2D mask alignment. To address occlusion ambiguities, a depth-consistency weighting mechanism is introduced, which, combined with SAM-generated mask fusion and reliability assessment of 3D-to-2D projections, improves instance completeness. The proposed method significantly outperforms current approaches across multiple benchmarks—including ScanNetV2, ScanNet200, ScanNet++, Replica, and Matterport3D—delivering more complete and robust zero-shot 3D instance segmentation.

Technology Category

Application Category

📝 Abstract

Conventional 3D instance segmentation methods rely on labor-intensive 3D annotations for supervised training, which limits their scalability and generalization to novel objects. Recent approaches leverage multi-view 2D masks from the Segment Anything Model (SAM) to guide the merging of 3D geometric primitives, thereby enabling zero-shot 3D instance segmentation. However, these methods typically process each frame independently and rely solely on 2D metrics, such as SAM prediction scores, to produce segmentation maps. This design overlooks multi-view correlations and inherent 3D priors, leading to inconsistent 2D masks across views and ultimately fragmented 3D segmentation. In this paper, we propose MV3DIS, a coarse-to-fine framework for zero-shot 3D instance segmentation that explicitly incorporates 3D priors. Specifically, we introduce a 3D-guided mask matching strategy that uses coarse 3D segments as a common reference to match 2D masks across views and consolidates multi-view mask consistency via 3D coverage distributions. Guided by these view-consistent 2D masks, the coarse 3D segments are further refined into precise 3D instances. Additionally, we introduce a depth consistency weighting scheme that quantifies projection reliability to suppress ambiguities from inter-object occlusions, thereby improving the robustness of 3D-to-2D correspondence. Extensive experiments on the ScanNetV2, ScanNet200, ScanNet++, Replica, and Matterport3D datasets demonstrate the effectiveness of MV3DIS, which achieves superior performance over previous methods

Problem

Research questions and friction points this paper is trying to address.

zero-shot 3D instance segmentation

multi-view mask matching

3D priors

mask consistency

3D-to-2D correspondence

Innovation

Methods, ideas, or system contributions that make the work stand out.

3D-guided mask matching

zero-shot 3D instance segmentation

multi-view consistency