Fast Autoregressive Video Diffusion and World Models with Temporal Cache Compression and Sparse Attention

📅 2026-02-02

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

This work addresses the high latency and rapidly growing GPU memory consumption in autoregressive video diffusion models during inference, which stem from the continuously expanding key-value (KV) cache and severely limit temporal length and long-range consistency. The paper presents the first systematic identification and mitigation of three key sources of redundancy, introducing a unified, training-free acceleration framework. This framework integrates temporally aligned KV cache compression (TempCache), approximate nearest neighbor–based cross-attention prompt filtering (AnnCA), and semantic-matching–guided sparse self-attention (AnnSA). The proposed method achieves 5–10× end-to-end speedup with nearly imperceptible degradation in visual quality, while maintaining stable throughput and approximately constant peak GPU memory usage even during extended-duration video generation.

Technology Category

Application Category

📝 Abstract

Autoregressive video diffusion models enable streaming generation, opening the door to long-form synthesis, video world models, and interactive neural game engines. However, their core attention layers become a major bottleneck at inference time: as generation progresses, the KV cache grows, causing both increasing latency and escalating GPU memory, which in turn restricts usable temporal context and harms long-range consistency. In this work, we study redundancy in autoregressive video diffusion and identify three persistent sources: near-duplicate cached keys across frames, slowly evolving (largely semantic) queries/keys that make many attention computations redundant, and cross-attention over long prompts where only a small subset of tokens matters per frame. Building on these observations, we propose a unified, training-free attention framework for autoregressive diffusion: TempCache compresses the KV cache via temporal correspondence to bound cache growth; AnnCA accelerates cross-attention by selecting frame-relevant prompt tokens using fast approximate nearest neighbor (ANN) matching; and AnnSA sparsifies self-attention by restricting each query to semantically matched keys, also using a lightweight ANN. Together, these modules reduce attention, compute, and memory and are compatible with existing autoregressive diffusion backbones and world models. Experiments demonstrate up to x5--x10 end-to-end speedups while preserving near-identical visual quality and, crucially, maintaining stable throughput and nearly constant peak GPU memory usage over long rollouts, where prior methods progressively slow down and suffer from increasing memory usage.

Problem

Research questions and friction points this paper is trying to address.

autoregressive video diffusion

KV cache growth

inference latency

GPU memory

long-range consistency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Temporal Cache Compression

Sparse Attention

Approximate Nearest Neighbor