MAGI-1: Autoregressive Video Generation at Scale

📅 2025-05-19

📈 Citations: 0

✨ Influential: 0

career value

224K/year

🤖 AI Summary

To address poor temporal consistency, computational intractability, and low deployment efficiency in autoregressive long-video generation, this paper proposes a scalable block-wise autoregressive world model. It partitions videos into fixed-length frame chunks and introduces a temporally monotonic denoising mechanism to enforce causal modeling and streaming generation. We pioneer three key innovations: (1) chunk-level monotonic noise scheduling, (2) chunk-wise prompt conditioning, and (3) constant-memory inference—collectively overcoming the bottleneck of long-range temporal modeling. The method integrates large-scale diffusion architecture, MagiAttention sparse attention, chunked denoising training, and a custom distributed inference stack. Our largest model contains 24 billion parameters and supports up to 4 million tokens of context. On text-conditioned image-to-video (I2V) generation, it achieves high-fidelity, temporally coherent real-time synthesis, with peak GPU memory consumption independent of video length.

Technology Category

Application Category

📝 Abstract

We present MAGI-1, a world model that generates videos by autoregressively predicting a sequence of video chunks, defined as fixed-length segments of consecutive frames. Trained to denoise per-chunk noise that increases monotonically over time, MAGI-1 enables causal temporal modeling and naturally supports streaming generation. It achieves strong performance on image-to-video (I2V) tasks conditioned on text instructions, providing high temporal consistency and scalability, which are made possible by several algorithmic innovations and a dedicated infrastructure stack. MAGI-1 facilitates controllable generation via chunk-wise prompting and supports real-time, memory-efficient deployment by maintaining constant peak inference cost, regardless of video length. The largest variant of MAGI-1 comprises 24 billion parameters and supports context lengths of up to 4 million tokens, demonstrating the scalability and robustness of our approach. The code and models are available at https://github.com/SandAI-org/MAGI-1 and https://github.com/SandAI-org/MagiAttention. The product can be accessed at https://sand.ai.

Problem

Research questions and friction points this paper is trying to address.

Autoregressive video generation with temporal consistency

Scalable streaming generation for long videos

Controllable video synthesis via chunk-wise prompting

Innovation

Methods, ideas, or system contributions that make the work stand out.

Autoregressive video chunk prediction for generation

Denoising per-chunk noise for temporal modeling

Chunk-wise prompting for controllable generation

🔎 Similar Papers

Pyramidal Flow Matching for Efficient Video Generative Modeling