LayerFlow: A Unified Model for Layer-aware Video Generation

📅 2025-06-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses key challenges in hierarchical video generation—poor foreground transparency, unclean backgrounds, and strong inter-layer coupling—by proposing the first unified layer-aware video generation framework. Methodologically, it introduces layer embeddings and sub-clip organization to extend text-to-video diffusion Transformers into a hierarchical modeling architecture; it further proposes a novel co-training strategy combining motion LoRA and content LoRA, leveraging both static layered images and synthetic video data to enable high-fidelity image-to-smooth-video transfer. Contributions include: (1) generating multi-layer videos with sharp visual quality, precise inter-layer separation, and temporally coherent motion—without requiring real layered video supervision; and (2) supporting diverse editing tasks—including foreground/background generation, mixed-scene synthesis, video decomposition, and layer completion—with state-of-the-art performance across multiple quantitative metrics.

Technology Category

Application Category

📝 Abstract
We present LayerFlow, a unified solution for layer-aware video generation. Given per-layer prompts, LayerFlow generates videos for the transparent foreground, clean background, and blended scene. It also supports versatile variants like decomposing a blended video or generating the background for the given foreground and vice versa. Starting from a text-to-video diffusion transformer, we organize the videos for different layers as sub-clips, and leverage layer embeddings to distinguish each clip and the corresponding layer-wise prompts. In this way, we seamlessly support the aforementioned variants in one unified framework. For the lack of high-quality layer-wise training videos, we design a multi-stage training strategy to accommodate static images with high-quality layer annotations. Specifically, we first train the model with low-quality video data. Then, we tune a motion LoRA to make the model compatible with static frames. Afterward, we train the content LoRA on the mixture of image data with high-quality layered images along with copy-pasted video data. During inference, we remove the motion LoRA thus generating smooth videos with desired layers.
Problem

Research questions and friction points this paper is trying to address.

Generates layer-aware videos from per-layer prompts
Supports decomposing blended videos and layer swapping
Trains model with limited high-quality layer-wise video data
Innovation

Methods, ideas, or system contributions that make the work stand out.

LayerFlow unifies layer-aware video generation
Uses layer embeddings for sub-clip distinction
Multi-stage training with static and video data
🔎 Similar Papers
Sihui Ji
Sihui Ji
The University of Hong Kong
AIGCComputer Vision
H
Hao Luo
DAMO Academy, Alibaba Group, Hupan Laboratory, China
X
Xi Chen
The University of Hong Kong, Hong Kong
Y
Yuanpeng Tu
The University of Hong Kong, Hong Kong
Yiyang Wang
Yiyang Wang
Intel Corporation
Machine LearningSignal ProcessingMarkov Decision Processes
Hengshuang Zhao
Hengshuang Zhao
The University of Hong Kong
Computer VisionMachine LearningArtificial Intelligence