Boosting Unsupervised Video Instance Segmentation with Automatic Quality-Guided Self-Training

๐Ÿ“… 2025-12-07
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Unsupervised video instance segmentation (VIS) suffers from domain gaps between synthetic and real-world videos, hindering simultaneous achievement of pixel-accurate masks and temporal consistency. To address this, we propose AutoQ-VIS, the first quality-guided self-training framework for VIS: it initializes pseudo-labels using synthetic data, employs a lightweight automatic quality assessment module to select high-confidence samples, and iteratively refines both mask accuracy and temporal coherence in an end-to-end domain adaptation pipeline. Crucially, AutoQ-VIS establishes the first fully automated, quality-aware closed-loop self-training paradigm for VISโ€”requiring no human annotations whatsoever. On YouTube-VIS 2019 val, it achieves 52.6 APโ‚…โ‚€, outperforming the prior state-of-the-art unsupervised method VideoCutLER by 4.4 points, setting a new benchmark for unsupervised VIS.

Technology Category

Application Category

๐Ÿ“ Abstract
Video Instance Segmentation (VIS) faces significant annotation challenges due to its dual requirements of pixel-level masks and temporal consistency labels. While recent unsupervised methods like VideoCutLER eliminate optical flow dependencies through synthetic data, they remain constrained by the synthetic-to-real domain gap. We present AutoQ-VIS, a novel unsupervised framework that bridges this gap through quality-guided self-training. Our approach establishes a closed-loop system between pseudo-label generation and automatic quality assessment, enabling progressive adaptation from synthetic to real videos. Experiments demonstrate state-of-the-art performance with 52.6 $ ext{AP}_{50}$ on YouTubeVIS-2019 $ exttt{val}$ set, surpassing the previous state-of-the-art VideoCutLER by 4.4%, while requiring no human annotations. This demonstrates the viability of quality-aware self-training for unsupervised VIS. We will release the code at https://github.com/wcbup/AutoQ-VIS.
Problem

Research questions and friction points this paper is trying to address.

Unsupervised video instance segmentation faces annotation challenges
It addresses the synthetic-to-real domain gap in existing methods
It proposes a quality-guided self-training framework without human annotations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Quality-guided self-training for unsupervised video segmentation
Closed-loop system with automatic pseudo-label quality assessment
Progressive adaptation from synthetic to real videos
๐Ÿ”Ž Similar Papers
No similar papers found.