🤖 AI Summary
To address the challenge of balancing two-dimensional modeling fidelity and inference efficiency in autoregressive (AR) image generation, this paper proposes AiM—the first scalable AR image generator built upon the Mamba architecture. AiM processes raw pixel sequences as one-dimensional inputs, bypassing conventional multi-directional scanning and 2D structural modifications; it employs the native Mamba state-space model for next-token prediction, augmented with hierarchical quantization tokenization and a lightweight visual adaptation module. Its core contribution is the first end-to-end adaptation of Mamba to pure AR image generation, preserving its linear-complexity capability for long-sequence modeling. On ImageNet1K at 256×256 resolution, AiM (with 148M–1.3B parameters) achieves an FID of 2.21—surpassing prior AR models of comparable scale and matching diffusion-based methods in quality—while accelerating inference by 2×–10×.
📝 Abstract
We introduce AiM, an autoregressive (AR) image generative model based on Mamba architecture. AiM employs Mamba, a novel state-space model characterized by its exceptional performance for long-sequence modeling with linear time complexity, to supplant the commonly utilized Transformers in AR image generation models, aiming to achieve both superior generation quality and enhanced inference speed. Unlike existing methods that adapt Mamba to handle two-dimensional signals via multi-directional scan, AiM directly utilizes the next-token prediction paradigm for autoregressive image generation. This approach circumvents the need for extensive modifications to enable Mamba to learn 2D spatial representations. By implementing straightforward yet strategically targeted modifications for visual generative tasks, we preserve Mamba's core structure, fully exploiting its efficient long-sequence modeling capabilities and scalability. We provide AiM models in various scales, with parameter counts ranging from 148M to 1.3B. On the ImageNet1K 256*256 benchmark, our best AiM model achieves a FID of 2.21, surpassing all existing AR models of comparable parameter counts and demonstrating significant competitiveness against diffusion models, with 2 to 10 times faster inference speed. Code is available at https://github.com/hp-l33/AiM