AMS-KV: Adaptive KV Caching in Multi-Scale Visual Autoregressive Transformers

📅 2025-11-20

📈 Citations: 0

✨ Influential: 0

career value

223K/year

🤖 AI Summary

In multi-scale autoregressive vision Transformers, the key-value (KV) cache grows exponentially with scale, severely limiting model scalability and generation efficiency. Method: This paper presents the first systematic study of KV caching in multi-scale image generation, proposing an adaptive hierarchical caching strategy. Leveraging cross-scale key-value similarity analysis, it distinguishes between local-detail and global-condensed scales, dynamically identifying high-demand layers and prioritizing caching of critical information. The method jointly optimizes multi-scale modeling, cross-scale similarity measurement, and hierarchical cache allocation within vision Transformers, enabling end-to-end training and inference co-optimization. Results: Experiments demonstrate an 84.83% reduction in KV cache size, a 60.48% decrease in self-attention latency, and batch size scaling to 256 without GPU memory overflow—significantly improving the trade-off between generation throughput and output quality.

Technology Category

Application Category

📝 Abstract

Visual autoregressive modeling (VAR) via next-scale prediction has emerged as a scalable image generation paradigm. While Key and Value (KV) caching in large language models (LLMs) has been extensively studied, next-scale prediction presents unique challenges, and KV caching design for next-scale based VAR transformers remains largely unexplored. A major bottleneck is the excessive KV memory growth with the increasing number of scales-severely limiting scalability. Our systematic investigation reveals that: (1) Attending to tokens from local scales significantly contributes to generation quality (2) Allocating a small amount of memory for the coarsest scales, termed as condensed scales, stabilizes multi-scale image generation (3) Strong KV similarity across finer scales is predominantly observed in cache-efficient layers, whereas cache-demanding layers exhibit weaker inter-scale similarity. Based on the observations, we introduce AMS-KV, a scale-adaptive KV caching policy for next-scale prediction in VAR models. AMS-KV prioritizes storing KVs from condensed and local scales, preserving the most relevant tokens to maintain generation quality. It further optimizes KV cache utilization and computational efficiency identifying cache-demanding layers through inter-scale similarity analysis. Compared to the vanilla next-scale prediction-based VAR models, AMS-KV reduces KV cache usage by up to 84.83% and self-attention latency by 60.48%. Moreover, when the baseline VAR-d30 model encounters out-of-memory failures at a batch size of 128, AMS-KV enables stable scaling to a batch size of 256 with improved throughput.

Problem

Research questions and friction points this paper is trying to address.

Optimizing KV caching for multi-scale visual autoregressive transformers

Reducing excessive KV memory growth in next-scale prediction models

Improving computational efficiency through adaptive KV cache policies

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive KV caching prioritizes condensed and local scales

Optimizes cache utilization through inter-scale similarity analysis

Reduces KV cache usage and self-attention latency significantly

🔎 Similar Papers

Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference