FlowCut: Unsupervised Video Instance Segmentation via Temporal Mask Matching

📅 2025-05-19

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

This work addresses unsupervised video instance segmentation (UVIS). We propose a three-stage self-supervised framework: (1) fusing image and optical flow features to construct a pixel-wise affinity graph for generating initial pseudo-instance masks; (2) enforcing temporal consistency across frames via optical-flow-guided dynamic mask matching to form high-quality short video clips; and (3) performing pseudo-label distillation and end-to-end model training on YouTubeVIS-2021. To our knowledge, this is the first large-scale pseudo-label video dataset specifically designed for UVIS. Our key innovations lie in optical-flow-guided feature affinity modeling and cross-frame mask temporal matching. Extensive experiments demonstrate state-of-the-art performance on YouTubeVIS-2019/2021, DAVIS-2017, and DAVIS-2017 Motion benchmarks.

Technology Category

Application Category

📝 Abstract

We propose FlowCut, a simple and capable method for unsupervised video instance segmentation consisting of a three-stage framework to construct a high-quality video dataset with pseudo labels. To our knowledge, our work is the first attempt to curate a video dataset with pseudo-labels for unsupervised video instance segmentation. In the first stage, we generate pseudo-instance masks by exploiting the affinities of features from both images and optical flows. In the second stage, we construct short video segments containing high-quality, consistent pseudo-instance masks by temporally matching them across the frames. In the third stage, we use the YouTubeVIS-2021 video dataset to extract our training instance segmentation set, and then train a video segmentation model. FlowCut achieves state-of-the-art performance on the YouTubeVIS-2019, YouTubeVIS-2021, DAVIS-2017, and DAVIS-2017 Motion benchmarks.

Problem

Research questions and friction points this paper is trying to address.

Unsupervised video instance segmentation via temporal mask matching

Generating pseudo-instance masks using image and optical flow features

Constructing high-quality video datasets with consistent pseudo-labels

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates pseudo-instance masks using image and optical flow features

Constructs consistent video segments via temporal mask matching

Trains model using YouTubeVIS-2021 dataset for segmentation

🔎 Similar Papers

Context-Aware Video Instance Segmentation