🤖 AI Summary
This work addresses unsupervised video instance segmentation (UVIS). We propose a three-stage self-supervised framework: (1) fusing image and optical flow features to construct a pixel-wise affinity graph for generating initial pseudo-instance masks; (2) enforcing temporal consistency across frames via optical-flow-guided dynamic mask matching to form high-quality short video clips; and (3) performing pseudo-label distillation and end-to-end model training on YouTubeVIS-2021. To our knowledge, this is the first large-scale pseudo-label video dataset specifically designed for UVIS. Our key innovations lie in optical-flow-guided feature affinity modeling and cross-frame mask temporal matching. Extensive experiments demonstrate state-of-the-art performance on YouTubeVIS-2019/2021, DAVIS-2017, and DAVIS-2017 Motion benchmarks.
📝 Abstract
We propose FlowCut, a simple and capable method for unsupervised video instance segmentation consisting of a three-stage framework to construct a high-quality video dataset with pseudo labels. To our knowledge, our work is the first attempt to curate a video dataset with pseudo-labels for unsupervised video instance segmentation. In the first stage, we generate pseudo-instance masks by exploiting the affinities of features from both images and optical flows. In the second stage, we construct short video segments containing high-quality, consistent pseudo-instance masks by temporally matching them across the frames. In the third stage, we use the YouTubeVIS-2021 video dataset to extract our training instance segmentation set, and then train a video segmentation model. FlowCut achieves state-of-the-art performance on the YouTubeVIS-2019, YouTubeVIS-2021, DAVIS-2017, and DAVIS-2017 Motion benchmarks.