Make It Efficient: Dynamic Sparse Attention for Autoregressive Image Generation

📅 2025-06-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the excessive KV cache memory consumption and high inference latency caused by long sequences in autoregressive text-to-image generation, this paper proposes Adaptive Dynamic Sparse Attention (ADSA), a training-free method. ADSA dynamically identifies semantically critical historical tokens versus locally redundant ones by jointly modeling semantic importance, spatial layout, and texture dependencies; it further introduces a context-aware dynamic KV cache update mechanism to enable efficient pruning and cache management. The method preserves generation quality while reducing GPU memory footprint by approximately 50% during inference, significantly improving throughput and hardware resource utilization. Its core innovation lies in being the first to incorporate multi-granularity visual priors—derived from the generation process itself—into sparse attention decisions, thereby enabling semantic-driven, lightweight autoregressive inference.

Technology Category

Application Category

📝 Abstract
Autoregressive conditional image generation models have emerged as a dominant paradigm in text-to-image synthesis. These methods typically convert images into one-dimensional token sequences and leverage the self-attention mechanism, which has achieved remarkable success in natural language processing, to capture long-range dependencies, model global context, and ensure semantic coherence. However, excessively long contexts during inference lead to significant memory overhead caused by KV-cache and computational delays. To alleviate these challenges, we systematically analyze how global semantics, spatial layouts, and fine-grained textures are formed during inference, and propose a novel training-free context optimization method called Adaptive Dynamic Sparse Attention (ADSA). Conceptually, ADSA dynamically identifies historical tokens crucial for maintaining local texture consistency and those essential for ensuring global semantic coherence, thereby efficiently streamlining attention computation. Additionally, we introduce a dynamic KV-cache update mechanism tailored for ADSA, reducing GPU memory consumption during inference by approximately $50%$. Extensive qualitative and quantitative experiments demonstrate the effectiveness and superiority of our approach in terms of both generation quality and resource efficiency.
Problem

Research questions and friction points this paper is trying to address.

Reduce memory overhead from KV-cache in autoregressive image generation
Optimize attention computation for long-context inference delays
Maintain texture and semantic coherence while streamlining attention
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic sparse attention for efficient computation
Training-free context optimization method ADSA
Dynamic KV-cache reduces memory by 50%
Xunzhi Xiang
Xunzhi Xiang
Nanjing University
Q
Qi Fan
Nanjing University, School of Intelliger Science and Technology