Quantized Keys Steal Attention: Bias Correction for KV-Cache Compression in Video Diffusion

📅 2026-05-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work identifies and formalizes a previously unreported phenomenon in long video generation—termed “Jensen bias”—where low-bit quantization of key-value (KV) caches induces systematic attention distortion due to the convexity of the softmax function, leading to significant video quality degradation. To mitigate this, the authors propose a nearly zero-overhead online correction mechanism that dynamically adjusts quantization step sizes based on query norms and employs a second-order Taylor expansion to approximate unbiased attention scores. The method incurs no additional memory overhead and is compatible with ultra-low-bit quantization schemes such as INT2 and INT4. Experiments on MAGI-1, SkyReels-V2, and HY-WorldPlay demonstrate that INT2 quantization with the proposed correction closely matches BF16-level performance—outperforming INT4—while reducing KV cache memory usage by 50%.
📝 Abstract
Chunk-wise autoregressive video diffusion models rely on a KV cache of previously generated chunks to avoid redundant computation, but this cache quickly becomes a memory bottleneck as videos grow longer. Methods that quantize the KV cache to low bitwidths reduce memory pressure but degrade video quality. We show that a key driver of this degradation is a systematic bias in attention weights: due to the convexity of the exponential in softmax attention, quantization noise inflates the contribution of cached keys, a phenomenon we call the Jensen bias. This effect causes quantized keys to steal attention mass from the unquantized current chunk. We derive a per-attention-score correction that removes this bias in expectation, computed on the fly from the quantization step sizes of the cached keys and the query norm. Using a second-order Taylor approximation, the additional computational overhead is negligible, and no additional memory is needed alongside the cache. Evaluated on MAGI-1, SkyReels-V2, and HY-WorldPlay at INT2 quantization, our correction recovers most of the quality lost to aggressive quantization, reaching near-BF16 video quality, and can outperform INT4 quantization while using 50% less memory.
Problem

Research questions and friction points this paper is trying to address.

KV-cache compression
video diffusion
quantization bias
attention mechanism
memory bottleneck
Innovation

Methods, ideas, or system contributions that make the work stand out.

KV-cache quantization
attention bias correction
video diffusion models
Jensen bias
low-bit inference