MiniMax-Remover: Taming Bad Noise Helps Video Object Removal

📅 2025-05-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video object removal methods suffer from hallucinated objects, visual artifacts, and slow inference—primarily due to reliance on costly sampling procedures and classifier-free guidance (CFG). This paper proposes a two-stage lightweight framework. In Stage I, we decouple text conditioning and cross-attention to construct an efficient unconditional base generator. In Stage II, we introduce a novel minimax noise adversarial distillation mechanism, optimized via human-curated samples, which actively suppresses “adversarial noise” to enhance robustness and reconstruction fidelity. Crucially, our method completely eliminates CFG and text conditioning, requiring only six sampling steps for significant inference acceleration. Evaluated across multiple quantitative metrics, it achieves state-of-the-art performance: outputs are artifact-free, high-fidelity, and structurally minimalistic—demonstrating both architectural simplicity and substantial performance gains.

Technology Category

Application Category

📝 Abstract
Recent advances in video diffusion models have driven rapid progress in video editing techniques. However, video object removal, a critical subtask of video editing, remains challenging due to issues such as hallucinated objects and visual artifacts. Furthermore, existing methods often rely on computationally expensive sampling procedures and classifier-free guidance (CFG), resulting in slow inference. To address these limitations, we propose MiniMax-Remover, a novel two-stage video object removal approach. Motivated by the observation that text condition is not best suited for this task, we simplify the pretrained video generation model by removing textual input and cross-attention layers, resulting in a more lightweight and efficient model architecture in the first stage. In the second stage, we distilled our remover on successful videos produced by the stage-1 model and curated by human annotators, using a minimax optimization strategy to further improve editing quality and inference speed. Specifically, the inner maximization identifies adversarial input noise ("bad noise") that makes failure removals, while the outer minimization step trains the model to generate high-quality removal results even under such challenging conditions. As a result, our method achieves a state-of-the-art video object removal results with as few as 6 sampling steps and doesn't rely on CFG, significantly improving inference efficiency. Extensive experiments demonstrate the effectiveness and superiority of MiniMax-Remover compared to existing methods. Codes and Videos are available at: https://minimax-remover.github.io.
Problem

Research questions and friction points this paper is trying to address.

Addresses hallucinated objects and artifacts in video removal
Reduces computational cost and slow inference in existing methods
Improves video object removal quality and efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Simplified model by removing text input
Two-stage minimax optimization strategy
Achieves fast inference without CFG
🔎 Similar Papers
No similar papers found.