Generative Omnimatte: Learning to Decompose Video into Layers

📅 2024-11-25

🏛️ arXiv.org

📈 Citations: 4

✨ Influential: 1

career value

199K/year

🤖 AI Summary

Existing hierarchical video decomposition methods rely on static background assumptions or require precise camera pose and depth estimation, limiting their applicability to dynamic scenes; moreover, they lack generative priors, resulting in implausible inpainting of dynamically occluded regions. This paper introduces the first generative-prior-based hierarchical video decomposition framework, which decomposes arbitrary captured videos into semantically coherent and visually complete layered representations—including soft shadows, specular reflections, splashes, and other natural effects—using only object masks as input, without requiring static backgrounds, camera poses, or depth information. Our approach fine-tunes a video diffusion model, leveraging a small-scale, meticulously annotated dataset and a mask-guided generative denoising mechanism to achieve photorealistic reconstruction of dynamically occluded regions. Experiments demonstrate state-of-the-art performance in modeling complex visual effects and ensuring occlusion completeness.

Technology Category

Application Category

📝 Abstract

Given a video and a set of input object masks, an omnimatte method aims to decompose the video into semantically meaningful layers containing individual objects along with their associated effects, such as shadows and reflections. Existing omnimatte methods assume a static background or accurate pose and depth estimation and produce poor decompositions when these assumptions are violated. Furthermore, due to the lack of generative prior on natural videos, existing methods cannot complete dynamic occluded regions. We present a novel generative layered video decomposition framework to address the omnimatte problem. Our method does not assume a stationary scene or require camera pose or depth information and produces clean, complete layers, including convincing completions of occluded dynamic regions. Our core idea is to train a video diffusion model to identify and remove scene effects caused by a specific object. We show that this model can be finetuned from an existing video inpainting model with a small, carefully curated dataset, and demonstrate high-quality decompositions and editing results for a wide range of casually captured videos containing soft shadows, glossy reflections, splashing water, and more.

Problem

Research questions and friction points this paper is trying to address.

Decompose video into layers with objects and effects

Overcome limitations of static background assumptions

Complete dynamic occluded regions using generative prior

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generative layered video decomposition without static assumptions

Video diffusion model for object effect removal

Fine-tuning from video inpainting with curated data

🔎 Similar Papers

Pyramidal Flow Matching for Efficient Video Generative Modeling