๐ค AI Summary
Traditional masked autoregressive (MAR) image generation models suffer from inferior sample quality compared to standard autoregressive (AR) models and low decoding efficiency. To address these limitations, this work proposes the Bidirectional LLaMA architecture, integrating bidirectional attention with 2D rotary positional encoding (2D RoPE), coupled with an efficient image tokenizer and large-scale parameter scaling. The resulting MaskGIL model achieves a competitive FID of 3.71 on ImageNet 256ร256 in just eight parallel decoding stepsโmatching state-of-the-art AR models while accelerating inference by 32ร. Notably, it is the first MAR framework extended to text-to-image and real-time speech-to-image multimodal generation, supporting multi-resolution outputs with only 775M parameters. The core innovation lies in the synergistic design of bidirectional autoregressive modeling and 2D RoPE, enabling unified, high-fidelity, and computationally efficient multimodal generation.
๐ Abstract
AutoRegressive (AR) models have made notable progress in image generation, with Masked AutoRegressive (MAR) models gaining attention for their efficient parallel decoding. However, MAR models have traditionally underperformed when compared to standard AR models. This study refines the MAR architecture to improve image generation quality. We begin by evaluating various image tokenizers to identify the most effective one. Subsequently, we introduce an improved Bidirectional LLaMA architecture by replacing causal attention with bidirectional attention and incorporating 2D RoPE, which together form our advanced model, MaskGIL. Scaled from 111M to 1.4B parameters, MaskGIL achieves a FID score of 3.71, matching state-of-the-art AR models in the ImageNet 256x256 benchmark, while requiring only 8 inference steps compared to the 256 steps of AR models. Furthermore, we develop a text-driven MaskGIL model with 775M parameters for generating images from text at various resolutions. Beyond image generation, MaskGIL extends to accelerate AR-based generation and enable real-time speech-to-image conversion. Our codes and models are available at https://github.com/synbol/MaskGIL.