Autoregressive Video Generation beyond Next Frames Prediction

📅 2025-09-28

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

Existing autoregressive video generation models adopt individual frames as the fundamental prediction unit; however, this assumption lacks rigorous empirical validation, resulting in poor temporal coherence and low inference efficiency. To address this, we propose VideoAR—the first unified autoregressive framework supporting multi-granularity spatio-temporal units. Its core innovation lies in replacing frame-wise modeling with learnable spatio-temporal cubes as the basic prediction unit. VideoAR jointly models spatial and temporal dimensions through multi-scale refinement, key-frame-guided detail preservation, and flexible sequence decomposition. On the VBench benchmark, VideoAR significantly outperforms state-of-the-art methods: it achieves superior visual quality, faster inference speed, and—crucially—enables efficient generation of minute-long videos for the first time.

Technology Category

Application Category

📝 Abstract

Autoregressive models for video generation typically operate frame-by-frame, extending next-token prediction from language to video's temporal dimension. We question that unlike word as token is universally agreed in language if frame is a appropriate prediction unit? To address this, we present VideoAR, a unified framework that supports a spectrum of prediction units including full frames, key-detail frames, multiscale refinements, and spatiotemporal cubes. Among these designs, we find model video generation using extit{spatiotemporal} cubes as prediction units, which allows autoregressive models to operate across both spatial and temporal dimensions simultaneously. This approach eliminates the assumption that frames are the natural atomic units for video autoregression. We evaluate VideoAR across diverse prediction strategies, finding that cube-based prediction consistently delivers superior quality, speed, and temporal coherence. By removing the frame-by-frame constraint, our video generator surpasses state-of-the-art baselines on VBench while achieving faster inference and enabling seamless scaling to minute-long sequences. We hope this work will motivate rethinking sequence decomposition in video and other spatiotemporal domains.

Problem

Research questions and friction points this paper is trying to address.

Challenges frame as natural atomic unit for video autoregression

Proposes unified framework supporting diverse video prediction units

Enables autoregressive modeling across spatial and temporal dimensions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses spatiotemporal cubes as prediction units

Enables autoregression across spatial-temporal dimensions

Eliminates frame-by-frame generation constraint

🔎 Similar Papers

Pyramidal Flow Matching for Efficient Video Generative Modeling