VideoMAR: Autoregressive Video Generatio with Continuous Tokens

📅 2025-06-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing autoregressive video generation methods suffer from inefficiency and limited fidelity. This paper introduces VideoMAR—the first decoder-only, continuous-token, image-to-video autoregressive model that jointly models temporal causality and spatial bidirectionality. Key innovations include a next-frame diffusion loss, temporal short-to-long curriculum learning, spatial progressive-resolution training, and 3D rotational positional encoding—collectively enabling efficient spatiotemporal extrapolation. Technically, VideoMAR integrates KV cache sharing, spatially parallel decoding, and progressive temperature sampling. On VBench-I2V, VideoMAR surpasses Cosmos I2V with only 9.3% of its parameters, 0.5% of its training data, and 0.2% of its GPU resources, achieving state-of-the-art fidelity and efficiency in image-conditioned video synthesis.

Technology Category

Application Category

📝 Abstract
Masked-based autoregressive models have demonstrated promising image generation capability in continuous space. However, their potential for video generation remains under-explored. In this paper, we propose extbf{VideoMAR}, a concise and efficient decoder-only autoregressive image-to-video model with continuous tokens, composing temporal frame-by-frame and spatial masked generation. We first identify temporal causality and spatial bi-directionality as the first principle of video AR models, and propose the next-frame diffusion loss for the integration of mask and video generation. Besides, the huge cost and difficulty of long sequence autoregressive modeling is a basic but crucial issue. To this end, we propose the temporal short-to-long curriculum learning and spatial progressive resolution training, and employ progressive temperature strategy at inference time to mitigate the accumulation error. Furthermore, VideoMAR replicates several unique capacities of language models to video generation. It inherently bears high efficiency due to simultaneous temporal-wise KV cache and spatial-wise parallel generation, and presents the capacity of spatial and temporal extrapolation via 3D rotary embeddings. On the VBench-I2V benchmark, VideoMAR surpasses the previous state-of-the-art (Cosmos I2V) while requiring significantly fewer parameters ($9.3%$), training data ($0.5%$), and GPU resources ($0.2%$).
Problem

Research questions and friction points this paper is trying to address.

Exploring autoregressive models for video generation in continuous space
Addressing high cost of long sequence autoregressive modeling
Enhancing efficiency and extrapolation in video generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Continuous token autoregressive video generation
Temporal short-to-long curriculum learning
3D rotary embeddings for extrapolation
🔎 Similar Papers
No similar papers found.
H
Hu Yu
University of Science and Technology of China
Biao Gong
Biao Gong
Ant Group | Alibaba Group
Generative ModelRetrieval3D Vision
Hangjie Yuan
Hangjie Yuan
Alibaba DAMO | ZJU | MMLab@NTU
Generative ModelsMultimodal ModelsFoundation ModelsVideo Understanding
D
DanDan Zheng
University of Science and Technology of China
W
Weilong Chai
University of Science and Technology of China
J
Jingdong Chen
University of Science and Technology of China
K
Kecheng Zheng
University of Science and Technology of China
F
Feng Zhao
University of Science and Technology of China