Lumos-1: On Autoregressive Video Generation from a Unified Model Perspective

📅 2025-07-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing autoregressive video generation models face three key bottlenecks: architectural divergence from large language models (LLMs), reliance on external text encoders, and high latency due to token-by-token decoding. This paper proposes an efficient video generation framework unified within the LLM paradigm. We introduce MM-RoPE, a 3D positional encoding that extends standard RoPE to jointly model temporal and spatial structure while preserving textual modeling capability. We further propose the AR-DF training paradigm, integrating intra-frame bidirectional modeling with inter-frame causal modeling, augmented by time-tube masking and memory-aware optimization to mitigate inter-frame loss imbalance. Trained on only 48 GPUs, our model achieves competitive performance against EMU3, COSMOS-Video2World, and OpenSoraPlan on GenEval and VBench (I2V/T2V benchmarks). To our knowledge, this is the first work to realize high-fidelity, low-overhead autoregressive video generation entirely within a pure LLM architecture.

Technology Category

Application Category

📝 Abstract
Autoregressive large language models (LLMs) have unified a vast range of language tasks, inspiring preliminary efforts in autoregressive video generation. Existing autoregressive video generators either diverge from standard LLM architectures, depend on bulky external text encoders, or incur prohibitive latency due to next-token decoding. In this paper, we introduce Lumos-1, an autoregressive video generator that retains the LLM architecture with minimal architectural modifications. To inject spatiotemporal correlations in LLMs, we identify the efficacy of incorporating 3D RoPE and diagnose its imbalanced frequency spectrum ranges. Therefore, we propose MM-RoPE, a RoPE scheme that preserves the original textual RoPE while providing comprehensive frequency spectra and scaled 3D positions for modeling multimodal spatiotemporal data. Moreover, Lumos-1 resorts to a token dependency strategy that obeys intra-frame bidirectionality and inter-frame temporal causality. Based on this dependency strategy, we identify the issue of frame-wise loss imbalance caused by spatial information redundancy and solve it by proposing Autoregressive Discrete Diffusion Forcing (AR-DF). AR-DF introduces temporal tube masking during training with a compatible inference-time masking policy to avoid quality degradation. By using memory-efficient training techniques, we pre-train Lumos-1 on only 48 GPUs, achieving performance comparable to EMU3 on GenEval, COSMOS-Video2World on VBench-I2V, and OpenSoraPlan on VBench-T2V. Code and models are available at https://github.com/alibaba-damo-academy/Lumos.
Problem

Research questions and friction points this paper is trying to address.

Unify autoregressive video generation with standard LLM architecture
Address imbalanced frequency spectrum in spatiotemporal modeling
Solve frame-wise loss imbalance from spatial redundancy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Retains LLM architecture with minimal modifications
Uses MM-RoPE for multimodal spatiotemporal data
Employs AR-DF to balance frame-wise loss
🔎 Similar Papers
No similar papers found.
Hangjie Yuan
Hangjie Yuan
Alibaba DAMO | ZJU | MMLab@NTU
Generative ModelsMultimodal ModelsFoundation ModelsVideo Understanding
Weihua Chen
Weihua Chen
Alibaba DAMO Academy, previously NLPR, CASIA
Computer Vision
J
Jun Cen
DAMO Academy, Alibaba Group
H
Hu Yu
DAMO Academy, Alibaba Group
Jingyun Liang
Jingyun Liang
ETH Zurich
Image/ Video RestorationLow-Level VisionVideo Generation
S
Shuning Chang
DAMO Academy, Alibaba Group
Zhihui Lin
Zhihui Lin
Tsinghua University, China
Machine LearningDeep LearningVideo GenerationSegmentation
T
Tao Feng
Tsinghua University
P
Pengwei Liu
DAMO Academy, Alibaba Group
Jiazheng Xing
Jiazheng Xing
Zhejiang University
Generative AIVideo UnderstandingRepresentation Learning
H
Hao Luo
DAMO Academy, Alibaba Group
J
Jiasheng Tang
DAMO Academy, Alibaba Group
F
Fan Wang
DAMO Academy, Alibaba Group
Y
Yi Yang
Zhejiang University