VRMDiff: Text-Guided Video Referring Matting Generation of Diffusion

📅 2025-03-11

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

This paper introduces Video Referring Matting: a novel task that generates temporally coherent, semantically precise instance-level alpha mattes for video, given a natural language caption describing the target semantic instance. Methodologically, we construct the first large-scale triplet dataset (video/text/instance alpha), propose a Latent-Constructive loss to enhance multi-instance discrimination and controllable inter-instance interaction, and incorporate a text-to-video alignment prior derived from video diffusion models to enable end-to-end differentiable training. Experiments on our self-collected dataset of 10,000 videos demonstrate substantial improvements in temporal consistency and semantic alignment accuracy of the predicted mattes. Our code and dataset are publicly released.

Technology Category

Application Category

📝 Abstract

We propose a new task, video referring matting, which obtains the alpha matte of a specified instance by inputting a referring caption. We treat the dense prediction task of matting as video generation, leveraging the text-to-video alignment prior of video diffusion models to generate alpha mattes that are temporally coherent and closely related to the corresponding semantic instances. Moreover, we propose a new Latent-Constructive loss to further distinguish different instances, enabling more controllable interactive matting. Additionally, we introduce a large-scale video referring matting dataset with 10,000 videos. To the best of our knowledge, this is the first dataset that concurrently contains captions, videos, and instance-level alpha mattes. Extensive experiments demonstrate the effectiveness of our method. The dataset and code are available at https://github.com/Hansxsourse/VRMDiff.

Problem

Research questions and friction points this paper is trying to address.

Generates alpha mattes for video instances using text captions.

Ensures temporal coherence in video matting via diffusion models.

Introduces a dataset with captions, videos, and instance-level mattes.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Video diffusion models for alpha matte generation

Latent-Constructive loss for instance differentiation

Large-scale dataset with captions and alpha mattes

🔎 Similar Papers

Generative Video Diffusion for Unseen Novel Semantic Video Moment Retrieval

2024-01-24Citations: 0