SMITE: Segment Me In TimE

📅 2024-10-24

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

This work addresses the challenging task of open-vocabulary, instance-agnostic video object segmentation—where object categories are unconstrained, instance counts vary dynamically, and segmentation is guided by only a single frame or few-shot annotations. We propose a novel framework leveraging pretrained text-to-image diffusion models for the first time in video segmentation. Our method introduces latent-space semantic guidance, optical-flow-driven inter-frame propagation, adaptive mask disentanglement, and re-ranking to generate temporally coherent, pixel-accurate masks across frames. Compared to state-of-the-art methods, our approach achieves +4.2% mAP on DAVIS and +3.8% mAP on YouTube-VOS. It supports zero-shot generalization to unseen categories and enables interactive, real-time mask editing. By unifying generative priors with video-specific motion modeling, our framework significantly enhances flexibility and robustness for fine-grained video segmentation in open-world scenarios.

Technology Category

Application Category

📝 Abstract

Segmenting an object in a video presents significant challenges. Each pixel must be accurately labelled, and these labels must remain consistent across frames. The difficulty increases when the segmentation is with arbitrary granularity, meaning the number of segments can vary arbitrarily, and masks are defined based on only one or a few sample images. In this paper, we address this issue by employing a pre-trained text to image diffusion model supplemented with an additional tracking mechanism. We demonstrate that our approach can effectively manage various segmentation scenarios and outperforms state-of-the-art alternatives.

Problem

Research questions and friction points this paper is trying to address.

Accurate pixel labeling in video segmentation

Consistent labels across video frames

Handling arbitrary granularity in segmentation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses pre-trained diffusion model

Incorporates tracking mechanism

Handles arbitrary granularity segmentation

🔎 Similar Papers

On Efficient Variants of Segment Anything Model: A Survey