Autoregressive Image Generation with Linear Complexity: A Spatial-Aware Decay Perspective

πŸ“… 2025-07-02
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Autoregressive image generation models predominantly rely on Transformers, suffering from high computational complexity (O(NΒ²)) and substantial memory overhead. While linear attention reduces complexity to O(N), it neglects intrinsic 2D spatial structure, impairing long-range dependency modeling and degrading generation quality. To address this, we propose LASADGenβ€”a linear-attention-based efficient autoregressive image generation framework. Its core innovation is a spatially aware decay mechanism: learnable decay factors are constructed from genuine 2D pixel coordinates to explicitly model pairwise 2D distance dependencies; these are integrated with flattened-sequence positional encodings to enable selective contextual attention. Evaluated on ImageNet, LASADGen achieves state-of-the-art generation fidelity under linear complexity, significantly outperforming existing linear-attention approaches and striking an optimal trade-off between inference speed and perceptual quality.

Technology Category

Application Category

πŸ“ Abstract
Autoregressive (AR) models have garnered significant attention in image generation for their ability to effectively capture both local and global structures within visual data. However, prevalent AR models predominantly rely on the transformer architectures, which are beset by quadratic computational complexity concerning input sequence length and substantial memory overhead due to the necessity of maintaining key-value caches. Although linear attention mechanisms have successfully reduced this burden in language models, our initial experiments reveal that they significantly degrade image generation quality because of their inability to capture critical long-range dependencies in visual data. We propose Linear Attention with Spatial-Aware Decay (LASAD), a novel attention mechanism that explicitly preserves genuine 2D spatial relationships within the flattened image sequences by computing position-dependent decay factors based on true 2D spatial location rather than 1D sequence positions. Based on this mechanism, we present LASADGen, an autoregressive image generator that enables selective attention to relevant spatial contexts with linear complexity. Experiments on ImageNet show LASADGen achieves state-of-the-art image generation performance and computational efficiency, bridging the gap between linear attention's efficiency and spatial understanding needed for high-quality generation.
Problem

Research questions and friction points this paper is trying to address.

Reducing quadratic complexity in autoregressive image models
Preserving spatial relationships in linear attention mechanisms
Improving image generation quality with linear computational cost
Innovation

Methods, ideas, or system contributions that make the work stand out.

Linear Attention with Spatial-Aware Decay (LASAD)
Preserves 2D spatial relationships in sequences
Achieves linear complexity and high efficiency
πŸ”Ž Similar Papers
No similar papers found.
Y
Yuxin Mao
Northwestern Polytechnical University
Z
Zhen Qin
TapTap
J
Jinxing Zhou
Hefei University of Technology
H
Hui Deng
Northwestern Polytechnical University
Xuyang Shen
Xuyang Shen
MiniMax | ANU
Multimodal Machine Learning
B
Bin Fan
Northwestern Polytechnical University
J
Jing Zhang
Australian National University
Yiran Zhong
Yiran Zhong
PhD, Australian National University
LLMSelf-supervised LearningVisual Geometry LearningNatural Language ProcessingMultimodal
Y
Yuchao Dai
Northwestern Polytechnical University