HeatKV: Head-tuned KV-cache Compression for Visual Autoregressive Modeling

πŸ“… 2026-05-14
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

245K/year
πŸ€– AI Summary
This work addresses the high memory consumption of key-value (KV) caching in visual autoregressive (VAR) models for image generation, which often exceeds several gigabytes and limits practical deployment. The authors propose a fine-grained, attention-head-specific KV cache compression method that leverages offline calibration to assess each head’s reliance on historical tokens. By integrating attention-score-based head ranking, static pruning, and dynamic cache allocation, the approach enables differentiated compression under a fixed memory budget. Evaluated on the Infinity-2B model, this technique achieves a 2Γ— higher KV cache compression ratio compared to existing methods while preserving or even improving image fidelity, prompt alignment, and human-perceived quality, thereby establishing a new state of the art in VAR model cache compression.
πŸ“ Abstract
Visual Autoregressive (VAR) models have recently demonstrated impressive image generation quality while maintaining low latency. However, they suffer from severe KV-cache memory constraints, often requiring gigabytes of memory per generated image. We introduce HeatKV, a novel compression method that adapts cache allocation in each head based on its attention to previously generated scales. Using a small offline calibration set, the attention heads are ranked according to their attention scores over prior scales. Based on this ranking, we construct a static pruning schedule tailored to a given memory budget. Applied to the Infinity-2B model, HeatKV achieves $2 \times$ higher compression ratio in memory allocation for KV cache compared to existing methods, while maintaining similar or better image fidelity, prompt alignment and human perception score. Our method achieves a new state-of-the-art (SOTA) for VAR model KV-cache compression, showcasing the effectiveness of fine-grained, head-specific cache allocation.
Problem

Research questions and friction points this paper is trying to address.

KV-cache compression
Visual Autoregressive Modeling
memory constraints
image generation
attention heads
Innovation

Methods, ideas, or system contributions that make the work stand out.

KV-cache compression
visual autoregressive modeling
head-specific pruning
attention calibration
memory-efficient generation
πŸ”Ž Similar Papers
No similar papers found.