🤖 AI Summary
Addressing challenges in high-resolution real-world video inpainting—specifically temporal inconsistency and distortion arising from complex human-object-shadow interactions—this paper proposes a text-prompt-free, end-to-end video inpainting framework. Methodologically: (1) it introduces a reference-frame fusion mechanism and a dual-fusion implicit segmentation refinement module to enhance long-sequence temporal coherence and fine-detail fidelity; (2) it strengthens motion modeling via text-to-video diffusion priors, coupled with progressive denoising in the VAE latent space; and (3) it integrates a lightweight human-and-accessory segmentation module for high-precision mask generation. Evaluated on multiple challenging real-world video datasets, our method achieves significant improvements over state-of-the-art approaches, demonstrating breakthrough performance in visual quality, temporal consistency, and industrial-grade capability for long-duration, high-resolution video processing.
📝 Abstract
Inpainting for real-world human and pedestrian removal in high-resolution video clips presents significant challenges, particularly in achieving high-quality outcomes, ensuring temporal consistency, and managing complex object interactions that involve humans, their belongings, and their shadows. In this paper, we introduce VIP (Video Inpainting Pipeline), a novel promptless video inpainting framework for real-world human removal applications. VIP enhances a state-of-the-art text-to-video model with a motion module and employs a Variational Autoencoder (VAE) for progressive denoising in the latent space. Additionally, we implement an efficient human-and-belongings segmentation for precise mask generation. Sufficient experimental results demonstrate that VIP achieves superior temporal consistency and visual fidelity across diverse real-world scenarios, surpassing state-of-the-art methods on challenging datasets. Our key contributions include the development of the VIP pipeline, a reference frame integration technique, and the Dual-Fusion Latent Segment Refinement method, all of which address the complexities of inpainting in long, high-resolution video sequences.