UVOSAM: A Mask-free Paradigm for Unsupervised Video Object Segmentation via Segment Anything Model

📅 2023-05-22

🏛️ arXiv.org

📈 Citations: 18

✨ Influential: 0

career value

174K/year

🤖 AI Summary

Unsupervised video object segmentation (UVOS) suffers from heavy reliance on frame-wise mask annotations and limited generalization. To address this, we propose the first mask-free UVOS paradigm: for the first time, we adapt the Segment Anything Model (SAM) to the video domain, enabling temporal-consistent segmentation using only learnable bounding boxes as prompts—without any mask supervision. Our key contributions are: (1) STD-Net, a tracker featuring spatio-temporal decoupled deformable attention, significantly enhancing robustness and cross-frame consistency of box prompts under complex scenes; and (2) a prompt-driven video propagation framework coupled with unsupervised temporal feature alignment. Experiments demonstrate state-of-the-art performance on DAVIS2017-Unsupervised and YouTube-VIS 2019/2021, surpassing mainstream supervised methods despite zero mask supervision—achieving a J&F score of 68.3%. Moreover, our method exhibits strong generalization to weakly annotated data.

📝 Abstract

The current state-of-the-art methods for unsupervised video object segmentation (UVOS) require extensive training on video datasets with mask annotations, limiting their effectiveness in handling challenging scenarios. However, the Segment Anything Model (SAM) introduces a new prompt-driven paradigm for image segmentation, offering new possibilities. In this study, we investigate SAM's potential for UVOS through different prompt strategies. We then propose UVOSAM, a mask-free paradigm for UVOS that utilizes the STD-Net tracker. STD-Net incorporates a spatial-temporal decoupled deformable attention mechanism to establish an effective correlation between intra- and inter-frame features, remarkably enhancing the quality of box prompts in complex video scenes. Extensive experiments on the DAVIS2017-unsupervised and YoutubeVIS19&21 datasets demonstrate the superior performance of UVOSAM without mask supervision compared to existing mask-supervised methods, as well as its ability to generalize to weakly-annotated video datasets. Code can be found at https://github.com/alibaba/UVOSAM.

Problem

Research questions and friction points this paper is trying to address.

Eliminates need for mask annotations in video object segmentation

Leverages SAM model for unsupervised video segmentation

Enhances prompt quality in complex video scenes

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Segment Anything Model for mask-free UVOS

Incorporates STD-Net with spatiotemporal attention

Enhances box prompts in complex video scenes

🔎 Similar Papers

No similar papers found.