Generative Video Matting

📅 2025-08-11

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

Video matting has long suffered from the scarcity of high-quality alpha mattes for real-world videos, severely limiting model generalization. This paper proposes a generative video matting paradigm based on video diffusion models. First, we design a scalable synthetic data pipeline and employ large-scale pseudo-label pretraining to bootstrap representation learning. Second, we innovatively leverage the spatiotemporal priors encoded in pretrained video diffusion models to directly generate temporally coherent, high-fidelity alpha sequences in an end-to-end manner—eliminating frame-wise independent inference and post-hoc temporal aggregation. Our approach inherently bridges the synthetic-to-real domain gap, significantly enhancing modeling fidelity for complex human subjects and fine-grained hair structures. Quantitatively, our method achieves state-of-the-art performance on three standard benchmarks. Qualitatively, it demonstrates strong generalization across diverse real-world scenarios, delivering superior visual matting results.

Technology Category

Application Category

📝 Abstract

Video matting has traditionally been limited by the lack of high-quality ground-truth data. Most existing video matting datasets provide only human-annotated imperfect alpha and foreground annotations, which must be composited to background images or videos during the training stage. Thus, the generalization capability of previous methods in real-world scenarios is typically poor. In this work, we propose to solve the problem from two perspectives. First, we emphasize the importance of large-scale pre-training by pursuing diverse synthetic and pseudo-labeled segmentation datasets. We also develop a scalable synthetic data generation pipeline that can render diverse human bodies and fine-grained hairs, yielding around 200 video clips with a 3-second duration for fine-tuning. Second, we introduce a novel video matting approach that can effectively leverage the rich priors from pre-trained video diffusion models. This architecture offers two key advantages. First, strong priors play a critical role in bridging the domain gap between synthetic and real-world scenes. Second, unlike most existing methods that process video matting frame-by-frame and use an independent decoder to aggregate temporal information, our model is inherently designed for video, ensuring strong temporal consistency. We provide a comprehensive quantitative evaluation across three benchmark datasets, demonstrating our approach's superior performance, and present comprehensive qualitative results in diverse real-world scenes, illustrating the strong generalization capability of our method. The code is available at https://github.com/aim-uofa/GVM.

Problem

Research questions and friction points this paper is trying to address.

Address lack of high-quality video matting training data

Improve generalization in real-world video matting scenarios

Enhance temporal consistency in video matting outputs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale pre-training with diverse synthetic data

Scalable synthetic data generation for fine-tuning

Video diffusion model for temporal consistency

🔎 Similar Papers

A Survey on Future Frame Synthesis: Bridging Deterministic and Generative Approaches

2024-01-26Citations: 1

MotionAura: Generating High-Quality and Motion Consistent Videos using Discrete Diffusion

2024-10-10arXiv.orgCitations: 0

TikTok

San Jose, California

AI Research Scientist, Video Generation and Post Training, FAIR