Resurrect Mask AutoRegressive Modeling for Efficient and Scalable Image Generation

📅 2025-07-17

📈 Citations: 0

✨ Influential: 0

career value

230K/year

🤖 AI Summary

Traditional masked autoregressive (MAR) image generation models suffer from inferior sample quality compared to standard autoregressive (AR) models and low decoding efficiency. To address these limitations, this work proposes the Bidirectional LLaMA architecture, integrating bidirectional attention with 2D rotary positional encoding (2D RoPE), coupled with an efficient image tokenizer and large-scale parameter scaling. The resulting MaskGIL model achieves a competitive FID of 3.71 on ImageNet 256×256 in just eight parallel decoding steps—matching state-of-the-art AR models while accelerating inference by 32×. Notably, it is the first MAR framework extended to text-to-image and real-time speech-to-image multimodal generation, supporting multi-resolution outputs with only 775M parameters. The core innovation lies in the synergistic design of bidirectional autoregressive modeling and 2D RoPE, enabling unified, high-fidelity, and computationally efficient multimodal generation.

Technology Category

Application Category

📝 Abstract

AutoRegressive (AR) models have made notable progress in image generation, with Masked AutoRegressive (MAR) models gaining attention for their efficient parallel decoding. However, MAR models have traditionally underperformed when compared to standard AR models. This study refines the MAR architecture to improve image generation quality. We begin by evaluating various image tokenizers to identify the most effective one. Subsequently, we introduce an improved Bidirectional LLaMA architecture by replacing causal attention with bidirectional attention and incorporating 2D RoPE, which together form our advanced model, MaskGIL. Scaled from 111M to 1.4B parameters, MaskGIL achieves a FID score of 3.71, matching state-of-the-art AR models in the ImageNet 256x256 benchmark, while requiring only 8 inference steps compared to the 256 steps of AR models. Furthermore, we develop a text-driven MaskGIL model with 775M parameters for generating images from text at various resolutions. Beyond image generation, MaskGIL extends to accelerate AR-based generation and enable real-time speech-to-image conversion. Our codes and models are available at https://github.com/synbol/MaskGIL.

Problem

Research questions and friction points this paper is trying to address.

Improving MAR models for better image generation quality

Developing efficient parallel decoding with fewer inference steps

Enabling text-driven and real-time speech-to-image generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Improved MAR with bidirectional attention

Incorporated 2D RoPE in LLaMA

Scaled model from 111M to 1.4B parameters

🔎 Similar Papers

No similar papers found.