🤖 AI Summary
Video instance segmentation (VIS) faces significant challenges in unsupervised settings, including large domain gaps between synthetic and real-world data, and reliance on optical flow or manual annotations. To address these issues, we propose a quality-guided closed-loop self-training framework that enables unsupervised domain adaptation from synthetic to real-world videos. Our method introduces automated pseudo-label quality assessment and a progressive filtering mechanism, establishing a self-iterative “evaluate–filter–train” loop. Crucially, it eliminates the need for optical flow estimation and human annotations entirely. Evaluated on the YouTube-VIS 2019 validation set, our approach achieves 52.6 AP₅₀—surpassing the prior state-of-the-art unsupervised method VideoCutLER by 4.4 points. To the best of our knowledge, this is the first work to significantly narrow the synthetic-to-real domain gap in VIS under a zero-shot, annotation-free setting.
📝 Abstract
Video Instance Segmentation (VIS) faces significant annotation challenges due to its dual requirements of pixel-level masks and temporal consistency labels. While recent unsupervised methods like VideoCutLER eliminate optical flow dependencies through synthetic data, they remain constrained by the synthetic-to-real domain gap. We present AutoQ-VIS, a novel unsupervised framework that bridges this gap through quality-guided self-training. Our approach establishes a closed-loop system between pseudo-label generation and automatic quality assessment, enabling progressive adaptation from synthetic to real videos. Experiments demonstrate state-of-the-art performance with 52.6 $ ext{AP}_{50}$ on YouTubeVIS-2019 val set, surpassing the previous state-of-the-art VideoCutLER by 4.4$%$, while requiring no human annotations. This demonstrates the viability of quality-aware self-training for unsupervised VIS. The source code of our method is available at https://github.com/wcbup/AutoQ-VIS.