Autoregressive Image Generation Needs Only a Few Lines of Cached Tokens

📅 2025-12-04

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

Autoregressive image generation suffers from memory explosion and throughput bottlenecks due to linear growth of KV caches with decoding length. This paper proposes LineAR—a training-free, progressive KV cache compression method based on a 2D row-wise structure. Leveraging spatial locality in visual attention and inter-row dependencies, LineAR dynamically identifies and prunes low-information tokens, retaining only critical row-level caches. Its core innovation is a cross-row attention-guided, training-free compression strategy, fully compatible with diverse autoregressive image generation architectures. Evaluated on multiple state-of-the-art models, LineAR reduces GPU memory consumption by up to 67.61% and accelerates inference by up to 7.57×, while simultaneously achieving new SOTA performance on ImageNet and COCO generation benchmarks. Remarkably, it attains superior generation quality using only 1/6–1/8 of the original cache footprint.

Technology Category

Application Category

📝 Abstract

Autoregressive (AR) visual generation has emerged as a powerful paradigm for image and multimodal synthesis, owing to its scalability and generality. However, existing AR image generation suffers from severe memory bottlenecks due to the need to cache all previously generated visual tokens during decoding, leading to both high storage requirements and low throughput. In this paper, we introduce extbf{LineAR}, a novel, training-free progressive key-value (KV) cache compression pipeline for autoregressive image generation. By fully exploiting the intrinsic characteristics of visual attention, LineAR manages the cache at the line level using a 2D view, preserving the visual dependency regions while progressively evicting less-informative tokens that are harmless for subsequent line generation, guided by inter-line attention. LineAR enables efficient autoregressive (AR) image generation by utilizing only a few lines of cache, achieving both memory savings and throughput speedup, while maintaining or even improving generation quality. Extensive experiments across six autoregressive image generation models, including class-conditional and text-to-image generation, validate its effectiveness and generality. LineAR improves ImageNet FID from 2.77 to 2.68 and COCO FID from 23.85 to 22.86 on LlamaGen-XL and Janus-Pro-1B, while retaining only 1/6 KV cache. It also improves DPG on Lumina-mGPT-768 with just 1/8 KV cache. Additionally, LineAR achieves significant memory and throughput gains, including up to 67.61% memory reduction and 7.57x speedup on LlamaGen-XL, and 39.66% memory reduction and 5.62x speedup on Janus-Pro-7B.

Problem

Research questions and friction points this paper is trying to address.

Reduces memory bottlenecks in autoregressive image generation

Compresses KV cache to improve throughput and storage efficiency

Maintains generation quality while using minimal cached tokens

Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free KV cache compression for autoregressive image generation

Line-level cache management using 2D view and inter-line attention

Progressive eviction of less-informative tokens to reduce memory and boost throughput

🔎 Similar Papers

No similar papers found.