Object-WIPER : Training-Free Object and Associated Effect Removal in Videos

📅 2026-01-10

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

148K/year

🤖 AI Summary

This work addresses the challenge of semantically consistent and temporally coherent removal of dynamic objects in videos along with their associated visual effects, such as shadows and reflections. The authors propose a training-free editing framework built upon a pre-trained text-to-video diffusion Transformer (DiT), which leverages user-provided object masks and textual descriptions to locate and replace foreground visual tokens via cross-attention and self-attention mechanisms, thereby jointly eliminating both the target objects and their visual artifacts. Key contributions include the first demonstration of fine-tuning-free joint removal, integration of user-specified masks with attention-derived effect masks, and the introduction of novel evaluation metrics that assess temporal consistency, intra-frame coherence, and content fidelity. The method outperforms existing approaches on both DAVIS and a newly introduced benchmark, WIPER-Bench, and the code will be publicly released.

Technology Category

Application Category

📝 Abstract

In this paper, we introduce Object-WIPER, a training-free framework for removing dynamic objects and their associated visual effects from videos, and inpainting them with semantically consistent and temporally coherent content. Our approach leverages a pre-trained text-to-video diffusion transformer (DiT). Given an input video, a user-provided object mask, and query tokens describing the target object and its effects, we localize relevant visual tokens via visual-text cross-attention and visual self-attention. This produces an intermediate effect mask that we fuse with the user mask to obtain a final foreground token mask to replace. We first invert the video through the DiT to obtain structured noise, then reinitialize the masked tokens with Gaussian noise while preserving background tokens. During denoising, we copy values for the background tokens saved during inversion to maintain scene fidelity. To address the lack of suitable evaluation, we introduce a new object removal metric that rewards temporal consistency among foreground tokens across consecutive frames, coherence between foreground and background tokens within each frame, and dissimilarity between the input and output foreground tokens. Experiments on DAVIS and a newly curated real-world associated effect benchmark (WIPER-Bench) show that Object-WIPER surpasses both training-based and training-free baselines in terms of the metric, achieving clean removal and temporally stable reconstruction without any retraining. Our new benchmark, source code, and pre-trained models will be publicly available.

Problem

Research questions and friction points this paper is trying to address.

object removal

associated effects

video inpainting

training-free

temporal coherence

Innovation

Methods, ideas, or system contributions that make the work stand out.

training-free

video object removal

diffusion transformer