π€ AI Summary
Autoregressive image generation suffers from memory explosion and throughput bottlenecks due to linear growth of KV caches with decoding length. This paper proposes LineARβa training-free, progressive KV cache compression method based on a 2D row-wise structure. Leveraging spatial locality in visual attention and inter-row dependencies, LineAR dynamically identifies and prunes low-information tokens, retaining only critical row-level caches. Its core innovation is a cross-row attention-guided, training-free compression strategy, fully compatible with diverse autoregressive image generation architectures. Evaluated on multiple state-of-the-art models, LineAR reduces GPU memory consumption by up to 67.61% and accelerates inference by up to 7.57Γ, while simultaneously achieving new SOTA performance on ImageNet and COCO generation benchmarks. Remarkably, it attains superior generation quality using only 1/6β1/8 of the original cache footprint.
π Abstract
Autoregressive (AR) visual generation has emerged as a powerful paradigm for image and multimodal synthesis, owing to its scalability and generality. However, existing AR image generation suffers from severe memory bottlenecks due to the need to cache all previously generated visual tokens during decoding, leading to both high storage requirements and low throughput. In this paper, we introduce extbf{LineAR}, a novel, training-free progressive key-value (KV) cache compression pipeline for autoregressive image generation. By fully exploiting the intrinsic characteristics of visual attention, LineAR manages the cache at the line level using a 2D view, preserving the visual dependency regions while progressively evicting less-informative tokens that are harmless for subsequent line generation, guided by inter-line attention. LineAR enables efficient autoregressive (AR) image generation by utilizing only a few lines of cache, achieving both memory savings and throughput speedup, while maintaining or even improving generation quality. Extensive experiments across six autoregressive image generation models, including class-conditional and text-to-image generation, validate its effectiveness and generality. LineAR improves ImageNet FID from 2.77 to 2.68 and COCO FID from 23.85 to 22.86 on LlamaGen-XL and Janus-Pro-1B, while retaining only 1/6 KV cache. It also improves DPG on Lumina-mGPT-768 with just 1/8 KV cache. Additionally, LineAR achieves significant memory and throughput gains, including up to 67.61% memory reduction and 7.57x speedup on LlamaGen-XL, and 39.66% memory reduction and 5.62x speedup on Janus-Pro-7B.