Boosting Unsupervised Video Instance Segmentation with Automatic Quality-Guided Self-Training

📅 2025-12-07

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

Unsupervised video instance segmentation (VIS) suffers from domain gaps between synthetic and real-world videos, hindering simultaneous achievement of pixel-accurate masks and temporal consistency. To address this, we propose AutoQ-VIS, the first quality-guided self-training framework for VIS: it initializes pseudo-labels using synthetic data, employs a lightweight automatic quality assessment module to select high-confidence samples, and iteratively refines both mask accuracy and temporal coherence in an end-to-end domain adaptation pipeline. Crucially, AutoQ-VIS establishes the first fully automated, quality-aware closed-loop self-training paradigm for VIS—requiring no human annotations whatsoever. On YouTube-VIS 2019 val, it achieves 52.6 AP₅₀, outperforming the prior state-of-the-art unsupervised method VideoCutLER by 4.4 points, setting a new benchmark for unsupervised VIS.

Technology Category

Application Category

📝 Abstract

Video Instance Segmentation (VIS) faces significant annotation challenges due to its dual requirements of pixel-level masks and temporal consistency labels. While recent unsupervised methods like VideoCutLER eliminate optical flow dependencies through synthetic data, they remain constrained by the synthetic-to-real domain gap. We present AutoQ-VIS, a novel unsupervised framework that bridges this gap through quality-guided self-training. Our approach establishes a closed-loop system between pseudo-label generation and automatic quality assessment, enabling progressive adaptation from synthetic to real videos. Experiments demonstrate state-of-the-art performance with 52.6 $ ext{AP}_{50}$ on YouTubeVIS-2019 $ exttt{val}$ set, surpassing the previous state-of-the-art VideoCutLER by 4.4%, while requiring no human annotations. This demonstrates the viability of quality-aware self-training for unsupervised VIS. We will release the code at https://github.com/wcbup/AutoQ-VIS.

Problem

Research questions and friction points this paper is trying to address.

Unsupervised video instance segmentation faces annotation challenges

It addresses the synthetic-to-real domain gap in existing methods

It proposes a quality-guided self-training framework without human annotations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Quality-guided self-training for unsupervised video segmentation

Closed-loop system with automatic pseudo-label quality assessment

Progressive adaptation from synthetic to real videos

🔎 Similar Papers

Context-Aware Video Instance Segmentation