Speculative Decoding for Autoregressive Video Generation

📅 2026-04-19

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

This work introduces speculative decoding to autoregressive video generation for the first time, proposing the SDVG framework to accelerate inference. Tailored for video generation represented as continuous spatiotemporal tensors, SDVG replaces conventional token verification with an image-quality router and employs an ImageReward scoring mechanism based on worst-frame aggregation combined with a fixed threshold strategy, achieving a Pareto-optimal trade-off between speed and visual quality. The system leverages a 1.3B-parameter drafter model to generate candidate blocks and integrates VAE decoding, quality assessment, and KV cache management within a training-agnostic, plug-and-play architecture. Evaluated on MovieGenVideoBench, SDVG achieves up to 2.09× speedup while retaining 95.7% of original quality, and delivers 1.59× acceleration under standard settings with 98.1% quality preservation, substantially outperforming baseline methods.

Technology Category

Application Category

📝 Abstract

Autoregressive video diffusion is emerging as a promising paradigm for streaming video synthesis, with step distillation serving as the primary means of accelerating inference. Whether speculative decoding, the dominant acceleration strategy for large language models, can be effectively adapted to autoregressive video generation remains an open question, because video blocks are continuous spatiotemporal tensors with no token-level distribution for exact rejection sampling. We introduce SDVG, which brings speculative decoding to block-based autoregressive video diffusion by replacing token verification with an image-quality router. A 1.3B drafter proposes candidate blocks via four denoising steps; each block is VAE-decoded and scored by ImageReward using worst-frame aggregation--taking the minimum per-frame reward to catch single-frame artifacts that averaging would mask. Blocks scoring above a fixed threshold tau are accepted into the 14B target's KV cache; the rest are regenerated by the target. Two additional design choices prove critical: the first block is always force-rejected to anchor scene composition, and tau serves as a single knob that traces a smooth quality-speed Pareto frontier. On 1003 MovieGenVideoBench prompts (832x480), SDVG retains 98.1% of target-only VisionReward quality (0.0773 vs. 0.0788) at a 1.59x speedup with tau=-0.7, and reaches 2.09x at 95.7% quality retention--while consistently outperforming draft-only generation by over +17%. The framework is training-free, requires no architectural changes, and can be seamlessly integrated into existing autoregressive video generation pipelines.

Problem

Research questions and friction points this paper is trying to address.

speculative decoding

autoregressive video generation

video diffusion

inference acceleration

spatiotemporal tensors

Innovation

Methods, ideas, or system contributions that make the work stand out.

speculative decoding

autoregressive video generation

image-quality router