MTV-Inpaint: Multi-Task Long Video Inpainting

📅 2025-03-14

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

Existing video inpainting methods struggle to simultaneously achieve scene completion and controllable insertion of novel objects, while mainstream text-to-video (T2V) diffusion models lack input controllability and exhibit limited capability for long-video processing. This paper proposes the first unified framework for local, controllable inpainting of long videos (hundreds of frames). It introduces a dual-branch spatial attention mechanism to jointly model scene completion and object synthesis; incorporates an image-to-video (I2V) editing paradigm to enhance multimodal guidance—via text, masks, and reference images; and designs a two-stage architecture comprising keyframe inpainting followed by optical flow propagation to overcome temporal modeling bottlenecks. Our method achieves state-of-the-art performance on both scene completion and object insertion tasks, supporting object addition/removal, brush-based editing, and cross-modal conditional control.

Technology Category

Application Category

📝 Abstract

Video inpainting involves modifying local regions within a video, ensuring spatial and temporal consistency. Most existing methods focus primarily on scene completion (i.e., filling missing regions) and lack the capability to insert new objects into a scene in a controllable manner. Fortunately, recent advancements in text-to-video (T2V) diffusion models pave the way for text-guided video inpainting. However, directly adapting T2V models for inpainting remains limited in unifying completion and insertion tasks, lacks input controllability, and struggles with long videos, thereby restricting their applicability and flexibility. To address these challenges, we propose MTV-Inpaint, a unified multi-task video inpainting framework capable of handling both traditional scene completion and novel object insertion tasks. To unify these distinct tasks, we design a dual-branch spatial attention mechanism in the T2V diffusion U-Net, enabling seamless integration of scene completion and object insertion within a single framework. In addition to textual guidance, MTV-Inpaint supports multimodal control by integrating various image inpainting models through our proposed image-to-video (I2V) inpainting mode. Additionally, we propose a two-stage pipeline that combines keyframe inpainting with in-between frame propagation, enabling MTV-Inpaint to effectively handle long videos with hundreds of frames. Extensive experiments demonstrate that MTV-Inpaint achieves state-of-the-art performance in both scene completion and object insertion tasks. Furthermore, it demonstrates versatility in derived applications such as multi-modal inpainting, object editing, removal, image object brush, and the ability to handle long videos. Project page: https://mtv-inpaint.github.io/.

Problem

Research questions and friction points this paper is trying to address.

Unifies scene completion and object insertion in video inpainting.

Enhances controllability and flexibility for text-guided video inpainting.

Addresses challenges in handling long videos with hundreds of frames.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified multi-task video inpainting framework

Dual-branch spatial attention mechanism

Two-stage pipeline for long videos

🔎 Similar Papers

No similar papers found.