Lumina-Video: Efficient and Flexible Video Generation with Multi-scale Next-DiT

📅 2025-02-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Addressing core challenges in video generation—namely, modeling spatiotemporal dependencies, low computational efficiency, and limited controllability over motion dynamics—this paper introduces Multi-Scale Next-DiT. Methodologically: (1) it proposes a novel multi-scale joint patchification mechanism that unifies spatiotemporal modeling across varying spatial resolutions and frame rates; (2) it pioneers the explicit incorporation of motion scores as conditional signals into the DiT backbone, enabling fine-grained control over generated motion intensity; and (3) it adopts a progressive, multi-source (natural + synthetic) hybrid training paradigm, extended to video-audio co-generation (Lumina-V2A). Experiments demonstrate substantial improvements in visual fidelity and motion smoothness, achieved with high training and inference efficiency. The code is publicly available.

Technology Category

Application Category

📝 Abstract
Recent advancements have established Diffusion Transformers (DiTs) as a dominant framework in generative modeling. Building on this success, Lumina-Next achieves exceptional performance in the generation of photorealistic images with Next-DiT. However, its potential for video generation remains largely untapped, with significant challenges in modeling the spatiotemporal complexity inherent to video data. To address this, we introduce Lumina-Video, a framework that leverages the strengths of Next-DiT while introducing tailored solutions for video synthesis. Lumina-Video incorporates a Multi-scale Next-DiT architecture, which jointly learns multiple patchifications to enhance both efficiency and flexibility. By incorporating the motion score as an explicit condition, Lumina-Video also enables direct control of generated videos' dynamic degree. Combined with a progressive training scheme with increasingly higher resolution and FPS, and a multi-source training scheme with mixed natural and synthetic data, Lumina-Video achieves remarkable aesthetic quality and motion smoothness at high training and inference efficiency. We additionally propose Lumina-V2A, a video-to-audio model based on Next-DiT, to create synchronized sounds for generated videos. Codes are released at https://www.github.com/Alpha-VLLM/Lumina-Video.
Problem

Research questions and friction points this paper is trying to address.

Enhance video generation efficiency
Model spatiotemporal video complexity
Control video dynamics explicitly
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-scale Next-DiT architecture
Motion score as explicit condition
Progressive and multi-source training schemes
🔎 Similar Papers
No similar papers found.
Dongyang Liu
Dongyang Liu
MMLab CUHK
Image/Video GenerationLLMsVLMs
S
Shicheng Li
Shanghai AI Laboratory
Y
Yutong Liu
Shanghai AI Laboratory
Z
Zhen Li
The Chinese University of Hong Kong
K
Kai Wang
Shanghai AI Laboratory
X
Xinyue Li
Shanghai AI Laboratory
Q
Qi Qin
Shanghai AI Laboratory
Y
Yufei Liu
Shanghai AI Laboratory
Yi Xin
Yi Xin
California Institute of Technology
Industrial OrganizationEconometrics
Z
Zhongyu Li
Shanghai AI Laboratory, Nankai University
B
Bin Fu
Shanghai AI Laboratory
C
Chenyang Si
Shanghai AI Laboratory
Yuewen Cao
Yuewen Cao
The Chinese University of Hong Kong
Conghui He
Conghui He
Shanghai AI Laboratory
Data-centric AILLMDocument Intelligence
Ziwei Liu
Ziwei Liu
Associate Professor, Nanyang Technological University
Computer VisionMachine LearningComputer Graphics
Y
Yu Qiao
Shanghai AI Laboratory
Qibin Hou
Qibin Hou
Nankai University
Deep learningComputer visionVisual attention
H
Hongsheng Li
The Chinese University of Hong Kong, Shanghai AI Laboratory
P
Peng Gao
Shanghai AI Laboratory