VOID: Video Object and Interaction Deletion

📅 2026-04-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video object removal methods struggle to preserve scene consistency after eliminating objects involved in complex physical interactions—such as collisions—often resulting in implausible or distorted outputs. This work addresses this limitation by introducing high-order causal reasoning and explicit physical consistency into the task for the first time. The authors construct a synthetic dataset featuring counterfactual object removal using Kubric and HUMOTO, and leverage a vision-language model to identify regions affected by the removal. These regions then guide a video diffusion model to generate edits that adhere to physical laws. Experimental results demonstrate that the proposed approach significantly outperforms existing methods on both synthetic and real-world data, effectively maintaining dynamic scene coherence after object removal.
📝 Abstract
Existing video object removal methods excel at inpainting content "behind" the object and correcting appearance-level artifacts such as shadows and reflections. However, when the removed object has more significant interactions, such as collisions with other objects, current models fail to correct them and produce implausible results. We present VOID, a video object removal framework designed to perform physically-plausible inpainting in these complex scenarios. To train the model, we generate a new paired dataset of counterfactual object removals using Kubric and HUMOTO, where removing an object requires altering downstream physical interactions. During inference, a vision-language model identifies regions of the scene affected by the removed object. These regions are then used to guide a video diffusion model that generates physically consistent counterfactual outcomes. Experiments on both synthetic and real data show that our approach better preserves consistent scene dynamics after object removal compared to prior video object removal methods. We hope this framework sheds light on how to make video editing models better simulators of the world through high-level causal reasoning.
Problem

Research questions and friction points this paper is trying to address.

video object removal
physical interaction
counterfactual inpainting
scene dynamics
object interaction
Innovation

Methods, ideas, or system contributions that make the work stand out.

physically-plausible inpainting
video object removal
counterfactual generation
vision-language model
video diffusion model
🔎 Similar Papers
No similar papers found.