🤖 AI Summary
This work addresses key challenges in hierarchical video generation—poor foreground transparency, unclean backgrounds, and strong inter-layer coupling—by proposing the first unified layer-aware video generation framework. Methodologically, it introduces layer embeddings and sub-clip organization to extend text-to-video diffusion Transformers into a hierarchical modeling architecture; it further proposes a novel co-training strategy combining motion LoRA and content LoRA, leveraging both static layered images and synthetic video data to enable high-fidelity image-to-smooth-video transfer. Contributions include: (1) generating multi-layer videos with sharp visual quality, precise inter-layer separation, and temporally coherent motion—without requiring real layered video supervision; and (2) supporting diverse editing tasks—including foreground/background generation, mixed-scene synthesis, video decomposition, and layer completion—with state-of-the-art performance across multiple quantitative metrics.
📝 Abstract
We present LayerFlow, a unified solution for layer-aware video generation. Given per-layer prompts, LayerFlow generates videos for the transparent foreground, clean background, and blended scene. It also supports versatile variants like decomposing a blended video or generating the background for the given foreground and vice versa. Starting from a text-to-video diffusion transformer, we organize the videos for different layers as sub-clips, and leverage layer embeddings to distinguish each clip and the corresponding layer-wise prompts. In this way, we seamlessly support the aforementioned variants in one unified framework. For the lack of high-quality layer-wise training videos, we design a multi-stage training strategy to accommodate static images with high-quality layer annotations. Specifically, we first train the model with low-quality video data. Then, we tune a motion LoRA to make the model compatible with static frames. Afterward, we train the content LoRA on the mixture of image data with high-quality layered images along with copy-pasted video data. During inference, we remove the motion LoRA thus generating smooth videos with desired layers.