🤖 AI Summary
Existing video omni-matting methods rely heavily on multi-stage optimization or iterative inference at runtime, compromising efficiency and quality trade-offs while underutilizing generative priors. To address this, we propose an end-to-end hierarchical video decomposition framework that jointly separates the foreground and its associated visual effects (e.g., shadows, reflections) into an alpha matte and an effect layer. We introduce a novel dual-expert collaborative fine-tuning architecture: the Effect Expert—guided by DiT sensitivity analysis—models coarse-grained effects, while the Quality Expert refines the alpha matte via full-module LoRA adaptation. Both experts operate synergistically within a single diffusion sampling step. Our method builds upon a pre-trained video inpainting diffusion model, incorporating modular LoRA fine-tuning and staged noise scheduling. It achieves new state-of-the-art performance in both accuracy and speed for video omni-matting, enabling real-time post-processing and supporting diverse downstream visual editing tasks.
📝 Abstract
Existing video omnimatte methods typically rely on slow, multi-stage, or inference-time optimization pipelines that fail to fully exploit powerful generative priors, producing suboptimal decompositions. Our key insight is that, if a video inpainting model can be finetuned to remove the foreground-associated effects, then it must be inherently capable of perceiving these effects, and hence can also be finetuned for the complementary task: foreground layer decomposition with associated effects. However, although naïvely finetuning the inpainting model with LoRA applied to all blocks can produce high-quality alpha mattes, it fails to capture associated effects. Our systematic analysis reveals this arises because effect-related cues are primarily encoded in specific DiT blocks and become suppressed when LoRA is applied across all blocks. To address this, we introduce EasyOmnimatte, the first unified, end-to-end video omnimatte method. Concretely, we finetune a pretrained video inpainting diffusion model to learn dual complementary experts while keeping its original weights intact: an Effect Expert, where LoRA is applied only to effect-sensitive DiT blocks to capture the coarse structure of the foreground and associated effects, and a fully LoRA-finetuned Quality Expert learns to refine the alpha matte. During sampling, Effect Expert is used for denoising at early, high-noise steps, while Quality Expert takes over at later, low-noise steps. This design eliminates the need for two full diffusion passes, significantly reducing computational cost without compromising output quality. Ablation studies validate the effectiveness of this Dual-Expert strategy. Experiments demonstrate that EasyOmnimatte sets a new state-of-the-art for video omnimatte and enables various downstream tasks, significantly outperforming baselines in both quality and efficiency.