HeatKV: Head-tuned KV-cache Compression for Visual Autoregressive Modeling

📅 2026-05-14

📈 Citations: 0

✨ Influential: 0

career value

241K/year

🤖 AI Summary

This work addresses the high memory consumption of key-value (KV) caching in visual autoregressive (VAR) models for image generation, which often exceeds several gigabytes and limits practical deployment. The authors propose a fine-grained, attention-head-specific KV cache compression method that leverages offline calibration to assess each head’s reliance on historical tokens. By integrating attention-score-based head ranking, static pruning, and dynamic cache allocation, the approach enables differentiated compression under a fixed memory budget. Evaluated on the Infinity-2B model, this technique achieves a 2× higher KV cache compression ratio compared to existing methods while preserving or even improving image fidelity, prompt alignment, and human-perceived quality, thereby establishing a new state of the art in VAR model cache compression.

📝 Abstract

Visual Autoregressive (VAR) models have recently demonstrated impressive image generation quality while maintaining low latency. However, they suffer from severe KV-cache memory constraints, often requiring gigabytes of memory per generated image. We introduce HeatKV, a novel compression method that adapts cache allocation in each head based on its attention to previously generated scales. Using a small offline calibration set, the attention heads are ranked according to their attention scores over prior scales. Based on this ranking, we construct a static pruning schedule tailored to a given memory budget. Applied to the Infinity-2B model, HeatKV achieves $2 \times$ higher compression ratio in memory allocation for KV cache compared to existing methods, while maintaining similar or better image fidelity, prompt alignment and human perception score. Our method achieves a new state-of-the-art (SOTA) for VAR model KV-cache compression, showcasing the effectiveness of fine-grained, head-specific cache allocation.

Problem

Research questions and friction points this paper is trying to address.

KV-cache compression

Visual Autoregressive Modeling

memory constraints

image generation

attention heads

Innovation

Methods, ideas, or system contributions that make the work stand out.

KV-cache compression

visual autoregressive modeling

head-specific pruning