🤖 AI Summary
This work addresses the challenging task of open-vocabulary, instance-agnostic video object segmentation—where object categories are unconstrained, instance counts vary dynamically, and segmentation is guided by only a single frame or few-shot annotations. We propose a novel framework leveraging pretrained text-to-image diffusion models for the first time in video segmentation. Our method introduces latent-space semantic guidance, optical-flow-driven inter-frame propagation, adaptive mask disentanglement, and re-ranking to generate temporally coherent, pixel-accurate masks across frames. Compared to state-of-the-art methods, our approach achieves +4.2% mAP on DAVIS and +3.8% mAP on YouTube-VOS. It supports zero-shot generalization to unseen categories and enables interactive, real-time mask editing. By unifying generative priors with video-specific motion modeling, our framework significantly enhances flexibility and robustness for fine-grained video segmentation in open-world scenarios.
📝 Abstract
Segmenting an object in a video presents significant challenges. Each pixel must be accurately labelled, and these labels must remain consistent across frames. The difficulty increases when the segmentation is with arbitrary granularity, meaning the number of segments can vary arbitrarily, and masks are defined based on only one or a few sample images. In this paper, we address this issue by employing a pre-trained text to image diffusion model supplemented with an additional tracking mechanism. We demonstrate that our approach can effectively manage various segmentation scenarios and outperforms state-of-the-art alternatives.