S2D: Sparse-To-Dense Keymask Distillation for Unsupervised Video Instance Segmentation

📅 2025-12-16

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

Existing unsupervised video instance segmentation methods heavily rely on synthetic videos—e.g., generated by translating/scaling ImageNet images—and thus fail to capture complex real-world motion dynamics, including viewpoint changes, multi-object interactions, and camera motion. This work presents the first end-to-end unsupervised framework trained exclusively on real-world videos. We introduce a KeyMask selection mechanism to automatically identify high-quality, sparse keyframe masks; propose a sparse-to-dense distillation framework that integrates deep motion priors with an implicit mask propagation network to generate temporally consistent dense masks; and design a novel Temporal DropLoss to improve robustness against inter-frame perturbations and motion discontinuities. Our method achieves significant improvements over state-of-the-art approaches across multiple benchmarks, with substantial gains in both segmentation accuracy and temporal stability.

Technology Category

Application Category

📝 Abstract

In recent years, the state-of-the-art in unsupervised video instance segmentation has heavily relied on synthetic video data, generated from object-centric image datasets such as ImageNet. However, video synthesis by artificially shifting and scaling image instance masks fails to accurately model realistic motion in videos, such as perspective changes, movement by parts of one or multiple instances, or camera motion. To tackle this issue, we propose an unsupervised video instance segmentation model trained exclusively on real video data. We start from unsupervised instance segmentation masks on individual video frames. However, these single-frame segmentations exhibit temporal noise and their quality varies through the video. Therefore, we establish temporal coherence by identifying high-quality keymasks in the video by leveraging deep motion priors. The sparse keymask pseudo-annotations are then used to train a segmentation model for implicit mask propagation, for which we propose a Sparse-To-Dense Distillation approach aided by a Temporal DropLoss. After training the final model on the resulting dense labelset, our approach outperforms the current state-of-the-art across various benchmarks.

Problem

Research questions and friction points this paper is trying to address.

Develops unsupervised video instance segmentation using real video data

Establishes temporal coherence by identifying high-quality keymasks with motion priors

Trains a segmentation model via Sparse-To-Dense Distillation for mask propagation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unsupervised model trained on real video data

Keymask identification using deep motion priors

Sparse-to-dense distillation with Temporal DropLoss

🔎 Similar Papers

Deep Common Feature Mining for Efficient Video Semantic Segmentation