🤖 AI Summary
This work addresses the high latency and rapidly growing GPU memory consumption in autoregressive video diffusion models during inference, which stem from the continuously expanding key-value (KV) cache and severely limit temporal length and long-range consistency. The paper presents the first systematic identification and mitigation of three key sources of redundancy, introducing a unified, training-free acceleration framework. This framework integrates temporally aligned KV cache compression (TempCache), approximate nearest neighbor–based cross-attention prompt filtering (AnnCA), and semantic-matching–guided sparse self-attention (AnnSA). The proposed method achieves 5–10× end-to-end speedup with nearly imperceptible degradation in visual quality, while maintaining stable throughput and approximately constant peak GPU memory usage even during extended-duration video generation.
📝 Abstract
Autoregressive video diffusion models enable streaming generation, opening the door to long-form synthesis, video world models, and interactive neural game engines. However, their core attention layers become a major bottleneck at inference time: as generation progresses, the KV cache grows, causing both increasing latency and escalating GPU memory, which in turn restricts usable temporal context and harms long-range consistency. In this work, we study redundancy in autoregressive video diffusion and identify three persistent sources: near-duplicate cached keys across frames, slowly evolving (largely semantic) queries/keys that make many attention computations redundant, and cross-attention over long prompts where only a small subset of tokens matters per frame. Building on these observations, we propose a unified, training-free attention framework for autoregressive diffusion: TempCache compresses the KV cache via temporal correspondence to bound cache growth; AnnCA accelerates cross-attention by selecting frame-relevant prompt tokens using fast approximate nearest neighbor (ANN) matching; and AnnSA sparsifies self-attention by restricting each query to semantically matched keys, also using a lightweight ANN. Together, these modules reduce attention, compute, and memory and are compatible with existing autoregressive diffusion backbones and world models. Experiments demonstrate up to x5--x10 end-to-end speedups while preserving near-identical visual quality and, crucially, maintaining stable throughput and nearly constant peak GPU memory usage over long rollouts, where prior methods progressively slow down and suffer from increasing memory usage.