🤖 AI Summary
This work addresses the high computational cost of autoregressive vision models in high-resolution text-to-image synthesis by systematically analyzing design choices in linear attention mechanisms and constructing an efficient, high-fidelity autoregressive generative architecture. The key innovations include replacing Softmax with a more scalable division-based normalization, incorporating depthwise convolutions to enhance local modeling, and introducing a KV-gating mechanism for flexible memory management. The resulting model relies entirely on linear attention and achieves strong performance: a 2.18 FID on ImageNet with 1.4B parameters and a 0.74 FID on GenEval with 1.5B parameters, while reducing per-module FLOPs by 61% compared to conventional Softmax-based attention.
📝 Abstract
Autoregressive models with continuous tokens form a promising paradigm for visual generation, especially for text-to-image (T2I) synthesis, but they suffer from high computational cost. We study how to design compute-efficient linear attention within this framework. Specifically, we conduct a systematic empirical analysis of scaling behavior with respect to parameter counts under different design choices, focusing on (1) normalization paradigms in linear attention (division-based vs. subtraction-based) and (2) depthwise convolution for locality augmentation. Our results show that although subtraction-based normalization is effective for image classification, division-based normalization scales better for linear generative transformers. In addition, incorporating convolution for locality modeling plays a crucial role in autoregressive generation, consistent with findings in diffusion models. We further extend gating mechanisms, commonly used in causal linear attention, to the bidirectional setting and propose a KV gate. By introducing data-independent learnable parameters to the key and value states, the KV gate assigns token-wise memory weights, enabling flexible memory management similar to forget gates in language models. Based on these findings, we present LINA, a simple and compute-efficient T2I model built entirely on linear attention, capable of generating high-fidelity 1024x1024 images from user instructions. LINA achieves competitive performance on both class-conditional and T2I benchmarks, obtaining 2.18 FID on ImageNet (about 1.4B parameters) and 0.74 on GenEval (about 1.5B parameters). A single linear attention module reduces FLOPs by about 61 percent compared to softmax attention. Code and models are available at: https://github.com/techmonsterwang/LINA.