๐ค AI Summary
Unsupervised video instance segmentation (VIS) suffers from domain gaps between synthetic and real-world videos, hindering simultaneous achievement of pixel-accurate masks and temporal consistency. To address this, we propose AutoQ-VIS, the first quality-guided self-training framework for VIS: it initializes pseudo-labels using synthetic data, employs a lightweight automatic quality assessment module to select high-confidence samples, and iteratively refines both mask accuracy and temporal coherence in an end-to-end domain adaptation pipeline. Crucially, AutoQ-VIS establishes the first fully automated, quality-aware closed-loop self-training paradigm for VISโrequiring no human annotations whatsoever. On YouTube-VIS 2019 val, it achieves 52.6 APโ
โ, outperforming the prior state-of-the-art unsupervised method VideoCutLER by 4.4 points, setting a new benchmark for unsupervised VIS.
๐ Abstract
Video Instance Segmentation (VIS) faces significant annotation challenges due to its dual requirements of pixel-level masks and temporal consistency labels. While recent unsupervised methods like VideoCutLER eliminate optical flow dependencies through synthetic data, they remain constrained by the synthetic-to-real domain gap. We present AutoQ-VIS, a novel unsupervised framework that bridges this gap through quality-guided self-training. Our approach establishes a closed-loop system between pseudo-label generation and automatic quality assessment, enabling progressive adaptation from synthetic to real videos. Experiments demonstrate state-of-the-art performance with 52.6 $ ext{AP}_{50}$ on YouTubeVIS-2019 $ exttt{val}$ set, surpassing the previous state-of-the-art VideoCutLER by 4.4%, while requiring no human annotations. This demonstrates the viability of quality-aware self-training for unsupervised VIS. We will release the code at https://github.com/wcbup/AutoQ-VIS.