🤖 AI Summary
Visual autoregressive (VAR) models suffer from exponential growth of key-value (KV) cache memory consumption and computational redundancy during multi-scale inference, due to their coarse-to-fine generation paradigm. To address this, we propose ScaleKV, a novel KV compression framework that introduces scale-aware layer functional partitioning—distinguishing between *drafters* (coarse-scale layers) and *refiners* (fine-scale layers)—enabling differentiated KV cache management across scales. ScaleKV integrates architectural analysis of Transformers, multi-scale attention modeling, dynamic KV pruning, and hierarchical cache capacity allocation. Evaluated on the Infinity text-to-image VAR model, ScaleKV reduces KV cache memory usage to 10% of the baseline while preserving pixel-level reconstruction fidelity. This work constitutes the first systematic solution to KV cache explosion in multi-scale VAR inference, establishing a new paradigm for efficient visual generation.
📝 Abstract
Visual Autoregressive (VAR) modeling has garnered significant attention for its innovative next-scale prediction approach, which yields substantial improvements in efficiency, scalability, and zero-shot generalization. Nevertheless, the coarse-to-fine methodology inherent in VAR results in exponential growth of the KV cache during inference, causing considerable memory consumption and computational redundancy. To address these bottlenecks, we introduce ScaleKV, a novel KV cache compression framework tailored for VAR architectures. ScaleKV leverages two critical observations: varying cache demands across transformer layers and distinct attention patterns at different scales. Based on these insights, ScaleKV categorizes transformer layers into two functional groups: drafters and refiners. Drafters exhibit dispersed attention across multiple scales, thereby requiring greater cache capacity. Conversely, refiners focus attention on the current token map to process local details, consequently necessitating substantially reduced cache capacity. ScaleKV optimizes the multi-scale inference pipeline by identifying scale-specific drafters and refiners, facilitating differentiated cache management tailored to each scale. Evaluation on the state-of-the-art text-to-image VAR model family, Infinity, demonstrates that our approach effectively reduces the required KV cache memory to 10% while preserving pixel-level fidelity.