EasyOmnimatte: Taming Pretrained Inpainting Diffusion Models for End-to-End Video Layered Decomposition

📅 2025-12-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video omni-matting methods rely heavily on multi-stage optimization or iterative inference at runtime, compromising efficiency and quality trade-offs while underutilizing generative priors. To address this, we propose an end-to-end hierarchical video decomposition framework that jointly separates the foreground and its associated visual effects (e.g., shadows, reflections) into an alpha matte and an effect layer. We introduce a novel dual-expert collaborative fine-tuning architecture: the Effect Expert—guided by DiT sensitivity analysis—models coarse-grained effects, while the Quality Expert refines the alpha matte via full-module LoRA adaptation. Both experts operate synergistically within a single diffusion sampling step. Our method builds upon a pre-trained video inpainting diffusion model, incorporating modular LoRA fine-tuning and staged noise scheduling. It achieves new state-of-the-art performance in both accuracy and speed for video omni-matting, enabling real-time post-processing and supporting diverse downstream visual editing tasks.

Technology Category

Application Category

📝 Abstract
Existing video omnimatte methods typically rely on slow, multi-stage, or inference-time optimization pipelines that fail to fully exploit powerful generative priors, producing suboptimal decompositions. Our key insight is that, if a video inpainting model can be finetuned to remove the foreground-associated effects, then it must be inherently capable of perceiving these effects, and hence can also be finetuned for the complementary task: foreground layer decomposition with associated effects. However, although naïvely finetuning the inpainting model with LoRA applied to all blocks can produce high-quality alpha mattes, it fails to capture associated effects. Our systematic analysis reveals this arises because effect-related cues are primarily encoded in specific DiT blocks and become suppressed when LoRA is applied across all blocks. To address this, we introduce EasyOmnimatte, the first unified, end-to-end video omnimatte method. Concretely, we finetune a pretrained video inpainting diffusion model to learn dual complementary experts while keeping its original weights intact: an Effect Expert, where LoRA is applied only to effect-sensitive DiT blocks to capture the coarse structure of the foreground and associated effects, and a fully LoRA-finetuned Quality Expert learns to refine the alpha matte. During sampling, Effect Expert is used for denoising at early, high-noise steps, while Quality Expert takes over at later, low-noise steps. This design eliminates the need for two full diffusion passes, significantly reducing computational cost without compromising output quality. Ablation studies validate the effectiveness of this Dual-Expert strategy. Experiments demonstrate that EasyOmnimatte sets a new state-of-the-art for video omnimatte and enables various downstream tasks, significantly outperforming baselines in both quality and efficiency.
Problem

Research questions and friction points this paper is trying to address.

Develops end-to-end video layered decomposition method
Improves foreground layer and effect capture efficiency
Reduces computational cost while maintaining output quality
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tunes video inpainting diffusion model for dual tasks
Uses dual expert strategy with selective LoRA application
Reduces computational cost with single diffusion pass
Y
Yihan Hu
GVC Lab, Great Bay University
X
Xuelin Chen
Adobe Research
Xiaodong Cun
Xiaodong Cun
GVC Lab, Great Bay University
Computational PhotographyComputer VisionComputer Graphics