VIP: Video Inpainting Pipeline for Real World Human Removal

📅 2025-04-03

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

Addressing challenges in high-resolution real-world video inpainting—specifically temporal inconsistency and distortion arising from complex human-object-shadow interactions—this paper proposes a text-prompt-free, end-to-end video inpainting framework. Methodologically: (1) it introduces a reference-frame fusion mechanism and a dual-fusion implicit segmentation refinement module to enhance long-sequence temporal coherence and fine-detail fidelity; (2) it strengthens motion modeling via text-to-video diffusion priors, coupled with progressive denoising in the VAE latent space; and (3) it integrates a lightweight human-and-accessory segmentation module for high-precision mask generation. Evaluated on multiple challenging real-world video datasets, our method achieves significant improvements over state-of-the-art approaches, demonstrating breakthrough performance in visual quality, temporal consistency, and industrial-grade capability for long-duration, high-resolution video processing.

Technology Category

Application Category

📝 Abstract

Inpainting for real-world human and pedestrian removal in high-resolution video clips presents significant challenges, particularly in achieving high-quality outcomes, ensuring temporal consistency, and managing complex object interactions that involve humans, their belongings, and their shadows. In this paper, we introduce VIP (Video Inpainting Pipeline), a novel promptless video inpainting framework for real-world human removal applications. VIP enhances a state-of-the-art text-to-video model with a motion module and employs a Variational Autoencoder (VAE) for progressive denoising in the latent space. Additionally, we implement an efficient human-and-belongings segmentation for precise mask generation. Sufficient experimental results demonstrate that VIP achieves superior temporal consistency and visual fidelity across diverse real-world scenarios, surpassing state-of-the-art methods on challenging datasets. Our key contributions include the development of the VIP pipeline, a reference frame integration technique, and the Dual-Fusion Latent Segment Refinement method, all of which address the complexities of inpainting in long, high-resolution video sequences.

Problem

Research questions and friction points this paper is trying to address.

Achieving high-quality human removal in videos

Ensuring temporal consistency in video inpainting

Managing complex human-object interactions and shadows

Innovation

Methods, ideas, or system contributions that make the work stand out.

Promptless video inpainting framework for human removal

Motion-enhanced text-to-video model with VAE denoising

Efficient segmentation for precise mask generation

🔎 Similar Papers

No similar papers found.

TikTok

San Jose, California

AI Research Scientist, Computer Vision - Facebook Video Intelligence