🤖 AI Summary
This paper introduces Video Referring Matting: a novel task that generates temporally coherent, semantically precise instance-level alpha mattes for video, given a natural language caption describing the target semantic instance. Methodologically, we construct the first large-scale triplet dataset (video/text/instance alpha), propose a Latent-Constructive loss to enhance multi-instance discrimination and controllable inter-instance interaction, and incorporate a text-to-video alignment prior derived from video diffusion models to enable end-to-end differentiable training. Experiments on our self-collected dataset of 10,000 videos demonstrate substantial improvements in temporal consistency and semantic alignment accuracy of the predicted mattes. Our code and dataset are publicly released.
📝 Abstract
We propose a new task, video referring matting, which obtains the alpha matte of a specified instance by inputting a referring caption. We treat the dense prediction task of matting as video generation, leveraging the text-to-video alignment prior of video diffusion models to generate alpha mattes that are temporally coherent and closely related to the corresponding semantic instances. Moreover, we propose a new Latent-Constructive loss to further distinguish different instances, enabling more controllable interactive matting. Additionally, we introduce a large-scale video referring matting dataset with 10,000 videos. To the best of our knowledge, this is the first dataset that concurrently contains captions, videos, and instance-level alpha mattes. Extensive experiments demonstrate the effectiveness of our method. The dataset and code are available at https://github.com/Hansxsourse/VRMDiff.