Lumina-Video: Efficient and Flexible Video Generation with Multi-scale Next-DiT

📅 2025-02-10

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

Addressing core challenges in video generation—namely, modeling spatiotemporal dependencies, low computational efficiency, and limited controllability over motion dynamics—this paper introduces Multi-Scale Next-DiT. Methodologically: (1) it proposes a novel multi-scale joint patchification mechanism that unifies spatiotemporal modeling across varying spatial resolutions and frame rates; (2) it pioneers the explicit incorporation of motion scores as conditional signals into the DiT backbone, enabling fine-grained control over generated motion intensity; and (3) it adopts a progressive, multi-source (natural + synthetic) hybrid training paradigm, extended to video-audio co-generation (Lumina-V2A). Experiments demonstrate substantial improvements in visual fidelity and motion smoothness, achieved with high training and inference efficiency. The code is publicly available.

Technology Category

Application Category

📝 Abstract

Recent advancements have established Diffusion Transformers (DiTs) as a dominant framework in generative modeling. Building on this success, Lumina-Next achieves exceptional performance in the generation of photorealistic images with Next-DiT. However, its potential for video generation remains largely untapped, with significant challenges in modeling the spatiotemporal complexity inherent to video data. To address this, we introduce Lumina-Video, a framework that leverages the strengths of Next-DiT while introducing tailored solutions for video synthesis. Lumina-Video incorporates a Multi-scale Next-DiT architecture, which jointly learns multiple patchifications to enhance both efficiency and flexibility. By incorporating the motion score as an explicit condition, Lumina-Video also enables direct control of generated videos' dynamic degree. Combined with a progressive training scheme with increasingly higher resolution and FPS, and a multi-source training scheme with mixed natural and synthetic data, Lumina-Video achieves remarkable aesthetic quality and motion smoothness at high training and inference efficiency. We additionally propose Lumina-V2A, a video-to-audio model based on Next-DiT, to create synchronized sounds for generated videos. Codes are released at https://www.github.com/Alpha-VLLM/Lumina-Video.

Problem

Research questions and friction points this paper is trying to address.

Enhance video generation efficiency

Model spatiotemporal video complexity

Control video dynamics explicitly

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-scale Next-DiT architecture

Motion score as explicit condition

Progressive and multi-source training schemes

🔎 Similar Papers

Pyramidal Flow Matching for Efficient Video Generative Modeling