Fast Autoregressive Video Generation with Diagonal Decoding

📅 2025-03-18

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

Autoregressive Transformer-based video generation suffers from severe inference bottlenecks due to sequential, token-by-token decoding over extremely long sequences (tens of thousands of tokens). To address this, we propose a **training-agnostic diagonal decoding paradigm**: generating tokens in parallel along diagonal paths across the spatiotemporal token grid, enabling simultaneous intra-frame parallelism and inter-frame local overlap. This significantly improves inference efficiency for long videos. We further introduce an **attention pattern alignment fine-tuning strategy** to mitigate attention distribution shifts between training and inference—particularly critical for compact models. Our method requires no architectural modifications or re-pretraining, supporting flexible speed–quality trade-offs. Evaluated on multiple autoregressive video models (e.g., VideoGPT, Phenaki) and benchmarks (UCF101, WebVid), it achieves up to 10× inference acceleration while preserving visual fidelity comparable to native autoregressive decoding.

Technology Category

Application Category

📝 Abstract

Autoregressive Transformer models have demonstrated impressive performance in video generation, but their sequential token-by-token decoding process poses a major bottleneck, particularly for long videos represented by tens of thousands of tokens. In this paper, we propose Diagonal Decoding (DiagD), a training-free inference acceleration algorithm for autoregressively pre-trained models that exploits spatial and temporal correlations in videos. Our method generates tokens along diagonal paths in the spatial-temporal token grid, enabling parallel decoding within each frame as well as partially overlapping across consecutive frames. The proposed algorithm is versatile and adaptive to various generative models and tasks, while providing flexible control over the trade-off between inference speed and visual quality. Furthermore, we propose a cost-effective finetuning strategy that aligns the attention patterns of the model with our decoding order, further mitigating the training-inference gap on small-scale models. Experiments on multiple autoregressive video generation models and datasets demonstrate that DiagD achieves up to $10 imes$ speedup compared to naive sequential decoding, while maintaining comparable visual fidelity.

Problem

Research questions and friction points this paper is trying to address.

Accelerates autoregressive video generation decoding process.

Exploits spatial-temporal correlations for parallel token generation.

Balances inference speed and visual quality trade-off.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Diagonal Decoding accelerates autoregressive video generation.

Parallel decoding within and across frames enhances speed.

Finetuning aligns attention patterns with decoding order.

🔎 Similar Papers

Pyramidal Flow Matching for Efficient Video Generative Modeling