GenEraser: Generalizable Video Object Removal via Balanced Text-Mask Guidance and Decoupled Locator-Preserver

📅 2026-05-28

📈 Citations: 0

✨ Influential: 0

career value

153K/year

🤖 AI Summary

This work addresses the challenge of video object removal in out-of-domain scenarios, where existing methods struggle to simultaneously eliminate target objects and their associated physical effects—such as reflections, shadows, and smoke—due to spatiotemporal ambiguities. The authors propose GenEraser, a novel framework that introduces explicit text-guided and mask-cooperative mechanisms for the first time, leveraging multimodal priors from a Multi-Condition Mixture-of-Experts (MC-MoE) integrated with a diffusion Transformer to accurately identify complex physical artifacts. To dynamically balance guidance signals, they design a Learnable Depth Classifier-Free Guidance (LD-CFG) fusion strategy and adopt a decoupled locator-preserver architecture to mitigate optimization conflicts between semantic generalization and pixel-level fidelity. GenEraser achieves state-of-the-art performance, improving PSNR by 2.16 dB on ROSE and 1.44 dB on VOR-Eval, demonstrating exceptional generalization in open-world settings.

📝 Abstract

Video object removal frequently struggles to simultaneously eliminate target objects and their associated physical effects (e.g., smoke, reflections, light, and ripples) in out-of-domain scenarios due to complex spatiotemporal ambiguities. While existing methods primarily rely on spatial masks, they often fail to capture weakly correlated effects, and the potential of explicit textual guidance remains underexplored. Furthermore, a fundamental optimization conflict exists in removal models between high-level semantic generalization and precise pixel-level background preservation. To address these challenges, we propose GenEraser, a novel framework for generalized and high-fidelity video object and effect removal. First, we introduce a Multi-Conditional Mixture-of-Experts (MC-MoE) paired with Bipartite Text guidance to fully exploit the multimodal priors of Diffusion Transformers, significantly enhancing the identification of complex effects. Second, a Learnable Deep ``CFG'' Fusion mechanism (LD-CFG) is developed to adaptively balance the relative dominance of mask and textual conditions across diverse scenarios. Finally, we propose a Decoupled Expert Architecture, comprising a Locator and a Preserver, to mitigate the inherent trade-off between semantic generalization and pixel alignment. Extensive experiments demonstrate that our GenEraser surpasses recent state-of-the-art approaches, achieving significant quantitative improvements (e.g., $2.16$ dB and $1.44$ dB on the ROSE Benchmark and VOR-Eval, respectively) while maintaining exceptionally robust generalization in open-world scenarios. https://cyqii.github.io/GenEraser.github.io/

Problem

Research questions and friction points this paper is trying to address.

video object removal

physical effects

spatiotemporal ambiguities

semantic generalization

background preservation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-Conditional Mixture-of-Experts

Text-Mask Guidance

Decoupled Locator-Preserver