VVitCutLER: Towards Unsupervised Object Detection and Segmentation in Videos

📅 2026-05-11

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

This work addresses temporal drift and pseudo-label flickering in unsupervised video pixel-level understanding, challenges exacerbated by motion blur, occlusion, and rapid dynamics. To mitigate these issues, the authors propose the VVitCutLER framework, which integrates a Vision Transformer-based pseudo-label generator with a novel VitCut module that enforces cross-frame regional consistency to suppress error accumulation. Additionally, a distillation decoder and a cross-frame feature aggregation mechanism are introduced to enhance both instance mask quality and temporal coherence. Evaluated on standard video benchmarks, the method substantially outperforms existing unsupervised approaches, achieving notable improvements in object detection, instance segmentation accuracy, and temporal stability, thereby demonstrating the critical role of temporally consistent supervision in video pixel-level understanding.

📝 Abstract

Unsupervised pixel-level video understanding remains challenging in real-world scenarios, where motion blur, occlusion, and fast object dynamics often cause temporal drift and flickering pseudo-labels.We propose VVitCutLER, an unsupervised framework for video object detection and instance segmentation, which improves the quality of pseudo-labels through temporal consistency. Our core contribution is VitCut, a temporarily stable pseudo-label generator that reduces error accumulation during field degradation through cross-frame region consistency. Meanwhile, VitCut uses a distillation decoder to achieve effective instance mask prediction. Then, based on VitCut, VVitCutLER further integrates cross-frame feature aggregation to enhance video-level robustness. Extensive experiments on standard video benchmarks demonstrate that VVitCutLER significantly improves detection and segmentation performance while reducing temporal instability. These results highlight the importance of temporally consistent supervision for robust pixel-level video understanding.

Problem

Research questions and friction points this paper is trying to address.

unsupervised object detection

video instance segmentation

temporal consistency

pseudo-label flickering

pixel-level video understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

temporal consistency

unsupervised video object detection

pseudo-label stabilization