VideoMLA: Low-Rank Latent KV Cache for Minute-Scale Autoregressive Video Diffusion

📅 2026-05-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the high memory footprint of KV caching and substantial inference latency in long-sequence causal video diffusion models by introducing Multi-Head Latent Attention (MLA). MLA replaces conventional per-head independent key-value pairs with shared low-rank content latent variables and decoupled 3D rotational position encoding (3D-RoPE), drastically compressing the cache. The study presents the first application of MLA to video diffusion models and demonstrates that its effectiveness stems from the effective rank imposed by the MLA bottleneck, rather than the intrinsic low-rank structure of pre-trained attention. Experiments show that the proposed method achieves state-of-the-art long-sequence generation quality on VBench, reduces KV cache memory by 92.7%, and improves throughput by 1.23× on a single B200 GPU.
📝 Abstract
Long-rollout causal video diffusion has converged on a fixed-size sliding-window KV cache, with recent progress innovating within this layout by changing which tokens occupy the window or how their positions are encoded. The per-head KV layout itself, a dominant contributor to streaming memory and latency, has been mostly left unchanged. In this paper, we present the first study of Multi-Head Latent Attention (MLA) in video diffusion. VideoMLA replaces per-head keys and values with a shared low-rank content latent and a shared decoupled 3D-RoPE positional key, reducing per-token KV memory by 92.7% at every cached layer. We further investigate why MLA succeeds in video diffusion even though the spectral assumption often used to motivate it in language models does not hold: pretrained video attention is not low-rank, with 99%-energy effective rank far above any practical latent dimension. VideoMLA retains quality at compression ratios where direct spectral approximation would predict large reconstruction error. We show that the MLA bottleneck, rather than the pretrained spectrum, determines the effective rank: both spectral and random initialization occupy nearly the full rank budget from initialization, and training preserves this budget while adapting within it. On VBench, VideoMLA matches short-horizon streaming video diffusion baselines, achieves the best overall score at long horizons among evaluated methods, and improves throughput by 1.23x on a single B200.
Problem

Research questions and friction points this paper is trying to address.

KV cache
video diffusion
memory efficiency
autoregressive generation
attention mechanism
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-Head Latent Attention
Low-Rank KV Cache
Video Diffusion
3D-RoPE
Autoregressive Video Generation