Playing with Transformer at 30+ FPS via Next-Frame Diffusion

📅 2025-06-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the dual bottlenecks of low frame rates (<30 FPS) and high computational cost in autoregressive video generation, this work proposes the first real-time autoregressive video generation framework tailored for interactive applications. Methodologically, it introduces a novel integration of video-domain consistency distillation and action-driven speculative sampling, alongside a block-wise causal attention diffusion Transformer and parallel intra-frame token generation—ensuring temporal coherence while drastically reducing sampling steps. On an A100 GPU, our 310M-parameter model achieves >30 FPS streaming generation, supporting action-conditioned inputs and arbitrarily long outputs. Experiments demonstrate state-of-the-art performance in both visual quality and sampling efficiency compared to existing autoregressive video generation methods.

Technology Category

Application Category

📝 Abstract
Autoregressive video models offer distinct advantages over bidirectional diffusion models in creating interactive video content and supporting streaming applications with arbitrary duration. In this work, we present Next-Frame Diffusion (NFD), an autoregressive diffusion transformer that incorporates block-wise causal attention, enabling iterative sampling and efficient inference via parallel token generation within each frame. Nonetheless, achieving real-time video generation remains a significant challenge for such models, primarily due to the high computational cost associated with diffusion sampling and the hardware inefficiencies inherent to autoregressive generation. To address this, we introduce two innovations: (1) We extend consistency distillation to the video domain and adapt it specifically for video models, enabling efficient inference with few sampling steps; (2) To fully leverage parallel computation, motivated by the observation that adjacent frames often share the identical action input, we propose speculative sampling. In this approach, the model generates next few frames using current action input, and discard speculatively generated frames if the input action differs. Experiments on a large-scale action-conditioned video generation benchmark demonstrate that NFD beats autoregressive baselines in terms of both visual quality and sampling efficiency. We, for the first time, achieves autoregressive video generation at over 30 Frames Per Second (FPS) on an A100 GPU using a 310M model.
Problem

Research questions and friction points this paper is trying to address.

Achieving real-time video generation with autoregressive diffusion models
Reducing computational cost and hardware inefficiencies in video generation
Improving visual quality and sampling efficiency in action-conditioned videos
Innovation

Methods, ideas, or system contributions that make the work stand out.

Next-Frame Diffusion with block-wise causal attention
Consistency distillation adapted for video models
Speculative sampling for parallel frame generation
🔎 Similar Papers
2024-08-27arXiv.orgCitations: 51