MAGI-1: Autoregressive Video Generation at Scale

📅 2025-05-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address poor temporal consistency, computational intractability, and low deployment efficiency in autoregressive long-video generation, this paper proposes a scalable block-wise autoregressive world model. It partitions videos into fixed-length frame chunks and introduces a temporally monotonic denoising mechanism to enforce causal modeling and streaming generation. We pioneer three key innovations: (1) chunk-level monotonic noise scheduling, (2) chunk-wise prompt conditioning, and (3) constant-memory inference—collectively overcoming the bottleneck of long-range temporal modeling. The method integrates large-scale diffusion architecture, MagiAttention sparse attention, chunked denoising training, and a custom distributed inference stack. Our largest model contains 24 billion parameters and supports up to 4 million tokens of context. On text-conditioned image-to-video (I2V) generation, it achieves high-fidelity, temporally coherent real-time synthesis, with peak GPU memory consumption independent of video length.

Technology Category

Application Category

📝 Abstract
We present MAGI-1, a world model that generates videos by autoregressively predicting a sequence of video chunks, defined as fixed-length segments of consecutive frames. Trained to denoise per-chunk noise that increases monotonically over time, MAGI-1 enables causal temporal modeling and naturally supports streaming generation. It achieves strong performance on image-to-video (I2V) tasks conditioned on text instructions, providing high temporal consistency and scalability, which are made possible by several algorithmic innovations and a dedicated infrastructure stack. MAGI-1 facilitates controllable generation via chunk-wise prompting and supports real-time, memory-efficient deployment by maintaining constant peak inference cost, regardless of video length. The largest variant of MAGI-1 comprises 24 billion parameters and supports context lengths of up to 4 million tokens, demonstrating the scalability and robustness of our approach. The code and models are available at https://github.com/SandAI-org/MAGI-1 and https://github.com/SandAI-org/MagiAttention. The product can be accessed at https://sand.ai.
Problem

Research questions and friction points this paper is trying to address.

Autoregressive video generation with temporal consistency
Scalable streaming generation for long videos
Controllable video synthesis via chunk-wise prompting
Innovation

Methods, ideas, or system contributions that make the work stand out.

Autoregressive video chunk prediction for generation
Denoising per-chunk noise for temporal modeling
Chunk-wise prompting for controllable generation
🔎 Similar Papers
No similar papers found.
S
Sand.ai
H
Hansi Teng
H
Hongyu Jia
L
Lei Sun
Lingzhi Li
Lingzhi Li
Tongji University
Fire resistanceStrengthening and retrofitHigh performance concrete materials
M
Maolin Li
M
Mingqiu Tang
S
Shuai Han
T
Tianning Zhang
W
W. Q. Zhang
W
Weifeng Luo
Xiaoyang Kang
Xiaoyang Kang
Alibaba Group
Computer VisionGenerative Models
Y
Yuchen Sun
Y
Yue Cao
Y
Yunpeng Huang
Y
Yutong Lin
Y
Yuxin Fang
Z
Zewei Tao
Z
Zheng Zhang
Zhongshu Wang
Zhongshu Wang
Z
Zixun Liu
D
Dai Shi
G
Guoli Su
H
Hanwen Sun
Hong Pan
Hong Pan
Data Scientist, Swinburne University of Technology
Machine LearningDeep LearningComputer VisionMedical Image Processing
J
Jie Wang
J
Jiexin Sheng
M
Mingyan Cui
M
Min Hu
M
Ming Yan
S
Shucheng Yin
S
Siran Zhang
T
Tingting Liu
X
Xianping Yin
Xiaoyu Yang
Xiaoyu Yang
University of Cambridge
Speech recognitionmachine learning
X
Xin Song
Xuan Hu
Xuan Hu
Director of AI Platform, AgiBot
AI InfraMLOpsAlgorithm Engineering
Y
Yankai Zhang
Y
Yu-Qian Li