From Understanding to Erasing: Towards Complete and Stable Video Object Removal

📅 2026-04-02

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

Video object removal is highly challenging as it requires not only eliminating the target object but also its associated artifacts—such as shadows and reflections—while preserving spatiotemporal consistency. This work proposes a novel approach that, for the first time, incorporates object-effect relationships learned by vision foundation models into video diffusion models via external knowledge distillation. To enhance the model’s joint physical and semantic understanding of the target, its side effects, and the background, we introduce a frame-level contextual cross-attention mechanism. Evaluated on the first real-world benchmark for video object removal, our method significantly outperforms existing techniques, achieving more complete, stable, and visually coherent erasure results.

Technology Category

Application Category

📝 Abstract

Video object removal aims to eliminate target objects from videos while plausibly completing missing regions and preserving spatio-temporal consistency. Although diffusion models have recently advanced this task, it remains challenging to remove object-induced side effects (e.g., shadows, reflections, and illumination changes) without compromising overall coherence. This limitation stems from the insufficient physical and semantic understanding of the target object and its interactions with the scene. In this paper, we propose to introduce understanding into erasing from two complementary perspectives. Externally, we introduce a distillation scheme that transfers the relationships between objects and their induced effects from vision foundation models to video diffusion models. Internally, we propose a framewise context cross-attention mechanism that grounds each denoising block in informative, unmasked context surrounding the target region. External and internal guidance jointly enable our model to understand the target object, its induced effects, and the global background context, resulting in clear and coherent object removal. Extensive experiments demonstrate our state-of-the-art performance, and we establish the first real-world benchmark for video object removal to facilitate future research and community progress. Our code, data, and models are available at: https://github.com/WeChatCV/UnderEraser.

Problem

Research questions and friction points this paper is trying to address.

video object removal

spatio-temporal consistency

object-induced side effects

coherent completion

scene understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

video object removal

diffusion models

knowledge distillation