VideoPainter: Any-length Video Inpainting and Editing with Plug-and-Play Context Control

📅 2025-03-07

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

Existing video inpainting methods struggle to simultaneously preserve background context and generate plausible foreground objects, especially in fully occluded regions. To address this, we propose VideoPainter—a dual-stream architecture that injects background-aware contextual cues into a pretrained video DiT via a lightweight context encoder (using only 6% of the backbone’s parameters), enabling semantic-consistent, plug-and-play generation. We introduce a novel dual-stream context disentanglement paradigm and a target-region ID resampling mechanism to support arbitrary-length video inpainting. Furthermore, we construct VPData—the first large-scale, segmentation-annotated video inpainting dataset (390K+ clips)—and establish the VPBench benchmark. VideoPainter achieves state-of-the-art performance across eight core metrics, including video quality, mask fidelity, and text alignment, significantly outperforming prior methods. It also enables end-to-end video editing and facilitates editing-pair data synthesis.

Technology Category

Application Category

📝 Abstract

Video inpainting, which aims to restore corrupted video content, has experienced substantial progress. Despite these advances, existing methods, whether propagating unmasked region pixels through optical flow and receptive field priors, or extending image-inpainting models temporally, face challenges in generating fully masked objects or balancing the competing objectives of background context preservation and foreground generation in one model, respectively. To address these limitations, we propose a novel dual-stream paradigm VideoPainter that incorporates an efficient context encoder (comprising only 6% of the backbone parameters) to process masked videos and inject backbone-aware background contextual cues to any pre-trained video DiT, producing semantically consistent content in a plug-and-play manner. This architectural separation significantly reduces the model's learning complexity while enabling nuanced integration of crucial background context. We also introduce a novel target region ID resampling technique that enables any-length video inpainting, greatly enhancing our practical applicability. Additionally, we establish a scalable dataset pipeline leveraging current vision understanding models, contributing VPData and VPBench to facilitate segmentation-based inpainting training and assessment, the largest video inpainting dataset and benchmark to date with over 390K diverse clips. Using inpainting as a pipeline basis, we also explore downstream applications including video editing and video editing pair data generation, demonstrating competitive performance and significant practical potential. Extensive experiments demonstrate VideoPainter's superior performance in both any-length video inpainting and editing, across eight key metrics, including video quality, mask region preservation, and textual coherence.

Problem

Research questions and friction points this paper is trying to address.

Addresses challenges in video inpainting for fully masked objects.

Balances background context preservation and foreground generation.

Enables any-length video inpainting and editing with plug-and-play control.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-stream paradigm with efficient context encoder

Target region ID resampling for any-length video

Scalable dataset pipeline with VPData and VPBench

🔎 Similar Papers

No similar papers found.