π€ AI Summary
This work addresses the high memory consumption of key-value (KV) caching in visual autoregressive (VAR) models for image generation, which often exceeds several gigabytes and limits practical deployment. The authors propose a fine-grained, attention-head-specific KV cache compression method that leverages offline calibration to assess each headβs reliance on historical tokens. By integrating attention-score-based head ranking, static pruning, and dynamic cache allocation, the approach enables differentiated compression under a fixed memory budget. Evaluated on the Infinity-2B model, this technique achieves a 2Γ higher KV cache compression ratio compared to existing methods while preserving or even improving image fidelity, prompt alignment, and human-perceived quality, thereby establishing a new state of the art in VAR model cache compression.
π Abstract
Visual Autoregressive (VAR) models have recently demonstrated impressive image generation quality while maintaining low latency. However, they suffer from severe KV-cache memory constraints, often requiring gigabytes of memory per generated image. We introduce HeatKV, a novel compression method that adapts cache allocation in each head based on its attention to previously generated scales. Using a small offline calibration set, the attention heads are ranked according to their attention scores over prior scales. Based on this ranking, we construct a static pruning schedule tailored to a given memory budget. Applied to the Infinity-2B model, HeatKV achieves $2 \times$ higher compression ratio in memory allocation for KV cache compared to existing methods, while maintaining similar or better image fidelity, prompt alignment and human perception score. Our method achieves a new state-of-the-art (SOTA) for VAR model KV-cache compression, showcasing the effectiveness of fine-grained, head-specific cache allocation.