SimpleAR: Pushing the Frontier of Autoregressive Visual Generation through Pretraining, SFT, and RL

📅 2025-04-15

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

This work addresses the low fidelity and poor inference efficiency of autoregressive vision generative models in high-resolution (1024×1024) text-to-image synthesis. We propose SimpleAR—a lightweight, purely autoregressive framework with only 0.5B parameters—achieving high-quality generation without architectural modifications. To our knowledge, this is the first systematic demonstration of the strong competitiveness of the pure autoregressive paradigm in high-resolution text-to-image generation. We jointly optimize aesthetic quality and prompt alignment via supervised fine-tuning (SFT) and Group Relative Policy Optimization (GRPO). Leveraging vLLM for inference acceleration, SimpleAR generates a single 1024×1024 image in just 14 seconds, achieving a GenEval score of 0.59 and a DPG score of 79.66. The complete codebase and training pipeline are fully open-sourced to ensure reproducibility.

Technology Category

Application Category

📝 Abstract

This work presents SimpleAR, a vanilla autoregressive visual generation framework without complex architecure modifications. Through careful exploration of training and inference optimization, we demonstrate that: 1) with only 0.5B parameters, our model can generate 1024x1024 resolution images with high fidelity, and achieve competitive results on challenging text-to-image benchmarks, e.g., 0.59 on GenEval and 79.66 on DPG; 2) both supervised fine-tuning (SFT) and Group Relative Policy Optimization (GRPO) training could lead to significant improvements on generation aesthectics and prompt alignment; and 3) when optimized with inference acceleraton techniques like vLLM, the time for SimpleAR to generate an 1024x1024 image could be reduced to around 14 seconds. By sharing these findings and open-sourcing the code, we hope to reveal the potential of autoregressive visual generation and encourage more participation in this research field. Code is available at https://github.com/wdrink/SimpleAR.

Problem

Research questions and friction points this paper is trying to address.

Achieving high-fidelity 1024x1024 image generation with minimal parameters

Improving generation aesthetics and prompt alignment via SFT and GRPO

Reducing autoregressive visual generation time to 14 seconds

Innovation

Methods, ideas, or system contributions that make the work stand out.

Autoregressive framework with 0.5B parameters

SFT and GRPO enhance aesthetics and alignment

vLLM reduces 1024x1024 image generation to 14s

🔎 Similar Papers

Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation with Multimodal Generative Pretraining