🤖 AI Summary
This work addresses the challenges of temporal instability and visual inconsistency in video object removal under realistic conditions such as shadows, abrupt motion, and imperfect masks. To this end, the authors propose the SVOR framework, which integrates a MUSE strategy to fuse windowed masks for handling abrupt motion and mask deficiencies, and introduces a DA-Seg module that leverages denoising-aware localization priors. The framework employs a curriculum-based two-stage training scheme—comprising self-supervised pretraining followed by fine-tuning on synthetic data—to enhance robustness. Notably, SVOR is the first method to jointly achieve robustness against shadows, flickering artifacts, and mask defects, attaining state-of-the-art performance across multiple benchmarks and significantly improving the stability and practicality of video object removal in real-world scenarios.
📝 Abstract
Removing objects from videos remains difficult in the presence of real-world imperfections such as shadows, abrupt motion, and defective masks. Existing diffusion-based video inpainting models often struggle to maintain temporal stability and visual consistency under these challenges. We propose Stable Video Object Removal (SVOR), a robust framework that achieves shadow-free, flicker-free, and mask-defect-tolerant removal through three key designs: (1) Mask Union for Stable Erasure (MUSE), a windowed union strategy applied during temporal mask downsampling to preserve all target regions observed within each window, effectively handling abrupt motion and reducing missed removals; (2) Denoising-Aware Segmentation (DA-Seg), a lightweight segmentation head on a decoupled side branch equipped with Denoising-Aware AdaLN and trained with mask degradation to provide an internal diffusion-aware localization prior without affecting content generation; and (3) Curriculum Two-Stage Training: where Stage I performs self-supervised pretraining on unpaired real-background videos with online random masks to learn realistic background and temporal priors, and Stage II refines on synthetic pairs using mask degradation and side-effect-weighted losses, jointly removing objects and their associated shadows/reflections while improving cross-domain robustness. Extensive experiments show that SVOR attains new state-of-the-art results across multiple datasets and degraded-mask benchmarks, advancing video object removal from ideal settings toward real-world applications.