LayerFlow: A Unified Model for Layer-aware Video Generation

📅 2025-06-04

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

This work addresses key challenges in hierarchical video generation—poor foreground transparency, unclean backgrounds, and strong inter-layer coupling—by proposing the first unified layer-aware video generation framework. Methodologically, it introduces layer embeddings and sub-clip organization to extend text-to-video diffusion Transformers into a hierarchical modeling architecture; it further proposes a novel co-training strategy combining motion LoRA and content LoRA, leveraging both static layered images and synthetic video data to enable high-fidelity image-to-smooth-video transfer. Contributions include: (1) generating multi-layer videos with sharp visual quality, precise inter-layer separation, and temporally coherent motion—without requiring real layered video supervision; and (2) supporting diverse editing tasks—including foreground/background generation, mixed-scene synthesis, video decomposition, and layer completion—with state-of-the-art performance across multiple quantitative metrics.

Technology Category

Application Category

📝 Abstract

We present LayerFlow, a unified solution for layer-aware video generation. Given per-layer prompts, LayerFlow generates videos for the transparent foreground, clean background, and blended scene. It also supports versatile variants like decomposing a blended video or generating the background for the given foreground and vice versa. Starting from a text-to-video diffusion transformer, we organize the videos for different layers as sub-clips, and leverage layer embeddings to distinguish each clip and the corresponding layer-wise prompts. In this way, we seamlessly support the aforementioned variants in one unified framework. For the lack of high-quality layer-wise training videos, we design a multi-stage training strategy to accommodate static images with high-quality layer annotations. Specifically, we first train the model with low-quality video data. Then, we tune a motion LoRA to make the model compatible with static frames. Afterward, we train the content LoRA on the mixture of image data with high-quality layered images along with copy-pasted video data. During inference, we remove the motion LoRA thus generating smooth videos with desired layers.

Problem

Research questions and friction points this paper is trying to address.

Generates layer-aware videos from per-layer prompts

Supports decomposing blended videos and layer swapping

Trains model with limited high-quality layer-wise video data

Innovation

Methods, ideas, or system contributions that make the work stand out.

LayerFlow unifies layer-aware video generation

Uses layer embeddings for sub-clip distinction

Multi-stage training with static and video data

🔎 Similar Papers

Pyramidal Flow Matching for Efficient Video Generative Modeling

2024-10-08arXiv.orgCitations: 31

Apple

Cupertino, United States of America

AI Research Scientist, Computer Vision - Facebook Video Intelligence