Quantized Keys Steal Attention: Bias Correction for KV-Cache Compression in Video Diffusion

📅 2026-05-25

📈 Citations: 0

✨ Influential: 0

career value

220K/year

🤖 AI Summary

This work identifies and formalizes a previously unreported phenomenon in long video generation—termed “Jensen bias”—where low-bit quantization of key-value (KV) caches induces systematic attention distortion due to the convexity of the softmax function, leading to significant video quality degradation. To mitigate this, the authors propose a nearly zero-overhead online correction mechanism that dynamically adjusts quantization step sizes based on query norms and employs a second-order Taylor expansion to approximate unbiased attention scores. The method incurs no additional memory overhead and is compatible with ultra-low-bit quantization schemes such as INT2 and INT4. Experiments on MAGI-1, SkyReels-V2, and HY-WorldPlay demonstrate that INT2 quantization with the proposed correction closely matches BF16-level performance—outperforming INT4—while reducing KV cache memory usage by 50%.

📝 Abstract

Chunk-wise autoregressive video diffusion models rely on a KV cache of previously generated chunks to avoid redundant computation, but this cache quickly becomes a memory bottleneck as videos grow longer. Methods that quantize the KV cache to low bitwidths reduce memory pressure but degrade video quality. We show that a key driver of this degradation is a systematic bias in attention weights: due to the convexity of the exponential in softmax attention, quantization noise inflates the contribution of cached keys, a phenomenon we call the Jensen bias. This effect causes quantized keys to steal attention mass from the unquantized current chunk. We derive a per-attention-score correction that removes this bias in expectation, computed on the fly from the quantization step sizes of the cached keys and the query norm. Using a second-order Taylor approximation, the additional computational overhead is negligible, and no additional memory is needed alongside the cache. Evaluated on MAGI-1, SkyReels-V2, and HY-WorldPlay at INT2 quantization, our correction recovers most of the quality lost to aggressive quantization, reaching near-BF16 video quality, and can outperform INT4 quantization while using 50% less memory.

Problem

Research questions and friction points this paper is trying to address.

KV-cache compression

video diffusion

quantization bias

attention mechanism

memory bottleneck

Innovation

Methods, ideas, or system contributions that make the work stand out.

KV-cache quantization

attention bias correction

video diffusion models