Lumos-1: On Autoregressive Video Generation from a Unified Model Perspective

📅 2025-07-11

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

Existing autoregressive video generation models face three key bottlenecks: architectural divergence from large language models (LLMs), reliance on external text encoders, and high latency due to token-by-token decoding. This paper proposes an efficient video generation framework unified within the LLM paradigm. We introduce MM-RoPE, a 3D positional encoding that extends standard RoPE to jointly model temporal and spatial structure while preserving textual modeling capability. We further propose the AR-DF training paradigm, integrating intra-frame bidirectional modeling with inter-frame causal modeling, augmented by time-tube masking and memory-aware optimization to mitigate inter-frame loss imbalance. Trained on only 48 GPUs, our model achieves competitive performance against EMU3, COSMOS-Video2World, and OpenSoraPlan on GenEval and VBench (I2V/T2V benchmarks). To our knowledge, this is the first work to realize high-fidelity, low-overhead autoregressive video generation entirely within a pure LLM architecture.

Technology Category

Application Category

📝 Abstract

Autoregressive large language models (LLMs) have unified a vast range of language tasks, inspiring preliminary efforts in autoregressive video generation. Existing autoregressive video generators either diverge from standard LLM architectures, depend on bulky external text encoders, or incur prohibitive latency due to next-token decoding. In this paper, we introduce Lumos-1, an autoregressive video generator that retains the LLM architecture with minimal architectural modifications. To inject spatiotemporal correlations in LLMs, we identify the efficacy of incorporating 3D RoPE and diagnose its imbalanced frequency spectrum ranges. Therefore, we propose MM-RoPE, a RoPE scheme that preserves the original textual RoPE while providing comprehensive frequency spectra and scaled 3D positions for modeling multimodal spatiotemporal data. Moreover, Lumos-1 resorts to a token dependency strategy that obeys intra-frame bidirectionality and inter-frame temporal causality. Based on this dependency strategy, we identify the issue of frame-wise loss imbalance caused by spatial information redundancy and solve it by proposing Autoregressive Discrete Diffusion Forcing (AR-DF). AR-DF introduces temporal tube masking during training with a compatible inference-time masking policy to avoid quality degradation. By using memory-efficient training techniques, we pre-train Lumos-1 on only 48 GPUs, achieving performance comparable to EMU3 on GenEval, COSMOS-Video2World on VBench-I2V, and OpenSoraPlan on VBench-T2V. Code and models are available at https://github.com/alibaba-damo-academy/Lumos.

Problem

Research questions and friction points this paper is trying to address.

Unify autoregressive video generation with standard LLM architecture

Address imbalanced frequency spectrum in spatiotemporal modeling

Solve frame-wise loss imbalance from spatial redundancy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Retains LLM architecture with minimal modifications

Uses MM-RoPE for multimodal spatiotemporal data

Employs AR-DF to balance frame-wise loss

🔎 Similar Papers

Pyramidal Flow Matching for Efficient Video Generative Modeling