EasyOmnimatte: Taming Pretrained Inpainting Diffusion Models for End-to-End Video Layered Decomposition

📅 2025-12-25

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

Existing video omni-matting methods rely heavily on multi-stage optimization or iterative inference at runtime, compromising efficiency and quality trade-offs while underutilizing generative priors. To address this, we propose an end-to-end hierarchical video decomposition framework that jointly separates the foreground and its associated visual effects (e.g., shadows, reflections) into an alpha matte and an effect layer. We introduce a novel dual-expert collaborative fine-tuning architecture: the Effect Expert—guided by DiT sensitivity analysis—models coarse-grained effects, while the Quality Expert refines the alpha matte via full-module LoRA adaptation. Both experts operate synergistically within a single diffusion sampling step. Our method builds upon a pre-trained video inpainting diffusion model, incorporating modular LoRA fine-tuning and staged noise scheduling. It achieves new state-of-the-art performance in both accuracy and speed for video omni-matting, enabling real-time post-processing and supporting diverse downstream visual editing tasks.

Technology Category

Application Category

📝 Abstract

Existing video omnimatte methods typically rely on slow, multi-stage, or inference-time optimization pipelines that fail to fully exploit powerful generative priors, producing suboptimal decompositions. Our key insight is that, if a video inpainting model can be finetuned to remove the foreground-associated effects, then it must be inherently capable of perceiving these effects, and hence can also be finetuned for the complementary task: foreground layer decomposition with associated effects. However, although naïvely finetuning the inpainting model with LoRA applied to all blocks can produce high-quality alpha mattes, it fails to capture associated effects. Our systematic analysis reveals this arises because effect-related cues are primarily encoded in specific DiT blocks and become suppressed when LoRA is applied across all blocks. To address this, we introduce EasyOmnimatte, the first unified, end-to-end video omnimatte method. Concretely, we finetune a pretrained video inpainting diffusion model to learn dual complementary experts while keeping its original weights intact: an Effect Expert, where LoRA is applied only to effect-sensitive DiT blocks to capture the coarse structure of the foreground and associated effects, and a fully LoRA-finetuned Quality Expert learns to refine the alpha matte. During sampling, Effect Expert is used for denoising at early, high-noise steps, while Quality Expert takes over at later, low-noise steps. This design eliminates the need for two full diffusion passes, significantly reducing computational cost without compromising output quality. Ablation studies validate the effectiveness of this Dual-Expert strategy. Experiments demonstrate that EasyOmnimatte sets a new state-of-the-art for video omnimatte and enables various downstream tasks, significantly outperforming baselines in both quality and efficiency.

Problem

Research questions and friction points this paper is trying to address.

Develops end-to-end video layered decomposition method

Improves foreground layer and effect capture efficiency

Reduces computational cost while maintaining output quality

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tunes video inpainting diffusion model for dual tasks

Uses dual expert strategy with selective LoRA application

Reduces computational cost with single diffusion pass

🔎 Similar Papers

MotionAura: Generating High-Quality and Motion Consistent Videos using Discrete Diffusion

2024-10-10arXiv.orgCitations: 0

TikTok

San Jose, California

Research Scientist Intern (TikTok-Privacy Innovation Lab-Multimodal Generative Model) - 2026 Start (PhD)

TikTok

San Jose, California

AI Research Scientist, Computer Vision - Facebook Video Intelligence