🤖 AI Summary
This work addresses the low fidelity and poor inference efficiency of autoregressive vision generative models in high-resolution (1024×1024) text-to-image synthesis. We propose SimpleAR—a lightweight, purely autoregressive framework with only 0.5B parameters—achieving high-quality generation without architectural modifications. To our knowledge, this is the first systematic demonstration of the strong competitiveness of the pure autoregressive paradigm in high-resolution text-to-image generation. We jointly optimize aesthetic quality and prompt alignment via supervised fine-tuning (SFT) and Group Relative Policy Optimization (GRPO). Leveraging vLLM for inference acceleration, SimpleAR generates a single 1024×1024 image in just 14 seconds, achieving a GenEval score of 0.59 and a DPG score of 79.66. The complete codebase and training pipeline are fully open-sourced to ensure reproducibility.
📝 Abstract
This work presents SimpleAR, a vanilla autoregressive visual generation framework without complex architecure modifications. Through careful exploration of training and inference optimization, we demonstrate that: 1) with only 0.5B parameters, our model can generate 1024x1024 resolution images with high fidelity, and achieve competitive results on challenging text-to-image benchmarks, e.g., 0.59 on GenEval and 79.66 on DPG; 2) both supervised fine-tuning (SFT) and Group Relative Policy Optimization (GRPO) training could lead to significant improvements on generation aesthectics and prompt alignment; and 3) when optimized with inference acceleraton techniques like vLLM, the time for SimpleAR to generate an 1024x1024 image could be reduced to around 14 seconds. By sharing these findings and open-sourcing the code, we hope to reveal the potential of autoregressive visual generation and encourage more participation in this research field. Code is available at https://github.com/wdrink/SimpleAR.