Class-agnostic 3D Segmentation by Granularity-Consistent Automatic 2D Mask Tracking

📅 2025-11-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing methods project foundation-model-derived 2D masks independently into 3D space to generate pseudo-labels, neglecting inter-frame consistency—leading to temporal conflicts and inconsistent granularity, which severely degrade 3D instance segmentation accuracy. To address this, we propose a fully automatic, annotation-free, end-to-end framework. First, we design a granularity-consistent, class-agnostic 2D mask tracking mechanism that ensures stable cross-frame propagation via explicit frame-to-frame correspondences. Second, we introduce a three-stage curriculum learning paradigm that progressively fuses fragmented single-view predictions to distill globally consistent, scene-level 3D supervision signals. Our method establishes the first fully automated pipeline—from raw video input to high-accuracy, temporally coherent, open-vocabulary-compatible 3D instance segmentation. Evaluated on mainstream benchmarks, it achieves state-of-the-art performance, significantly improving both segmentation accuracy and structural consistency.

Technology Category

Application Category

📝 Abstract
3D instance segmentation is an important task for real-world applications. To avoid costly manual annotations, existing methods have explored generating pseudo labels by transferring 2D masks from foundation models to 3D. However, this approach is often suboptimal since the video frames are processed independently. This causes inconsistent segmentation granularity and conflicting 3D pseudo labels, which degrades the accuracy of final segmentation. To address this, we introduce a Granularity-Consistent automatic 2D Mask Tracking approach that maintains temporal correspondences across frames, eliminating conflicting pseudo labels. Combined with a three-stage curriculum learning framework, our approach progressively trains from fragmented single-view data to unified multi-view annotations, ultimately globally coherent full-scene supervision. This structured learning pipeline enables the model to progressively expose to pseudo-labels of increasing consistency. Thus, we can robustly distill a consistent 3D representation from initially fragmented and contradictory 2D priors. Experimental results demonstrated that our method effectively generated consistent and accurate 3D segmentations. Furthermore, the proposed method achieved state-of-the-art results on standard benchmarks and open-vocabulary ability.
Problem

Research questions and friction points this paper is trying to address.

Addressing inconsistent granularity in 3D segmentation from 2D masks
Eliminating conflicting pseudo labels via temporal correspondence tracking
Progressively learning coherent 3D representations from fragmented 2D priors
Innovation

Methods, ideas, or system contributions that make the work stand out.

Granularity-Consistent automatic 2D Mask Tracking across frames
Three-stage curriculum learning for progressive training
Distilling consistent 3D representation from fragmented 2D priors
🔎 Similar Papers
No similar papers found.
J
Juan Wang
Graduate School of Engineering, Tohoku University, Sendai, Japan
Yasutomo Kawanishi
Yasutomo Kawanishi
Team Director, Riken Guardian Robot Project
Computer VisionPattern RecognitionMultimediaRobot Vision
T
Tomo Miyazaki
Graduate School of Engineering, Tohoku University, Sendai, Japan
Z
Zhijie Wang
Multimodal Visual Intelligence Team, RIKEN AIP, Sendai, Japan
Shinichiro Omachi
Shinichiro Omachi
Professor of Engineering, Tohoku University
pattern recognitionimage processingmachine learning