ReasonGen-R1: CoT for Autoregressive Image generation models through SFT and RL

📅 2025-05-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the weak reasoning capability in autoregressive image generation by pioneering the deep integration of Chain-of-Thought (CoT) reasoning with reinforcement learning (RL). Methodologically: (1) We construct the first multimodal reasoning corpus—pairing textual layouts, styles, and scene descriptions with corresponding images—and employ supervised fine-tuning to endow models with explicit, interpretable text-based reasoning; (2) We propose Group Relative Policy Optimization (GRPO), a novel RL algorithm that leverages a pretrained multimodal vision-language model as a reward estimator, enabling efficient and stable policy optimization. Experiments demonstrate state-of-the-art performance across three major benchmarks—GenEval, DPG, and T2I—achieving significant improvements in image plausibility, cross-element consistency, and fine-grained detail fidelity. Our approach establishes a new paradigm for controllable and interpretable image generation.

Technology Category

Application Category

📝 Abstract
Although chain-of-thought reasoning and reinforcement learning (RL) have driven breakthroughs in NLP, their integration into generative vision models remains underexplored. We introduce ReasonGen-R1, a two-stage framework that first imbues an autoregressive image generator with explicit text-based"thinking"skills via supervised fine-tuning on a newly generated reasoning dataset of written rationales, and then refines its outputs using Group Relative Policy Optimization. To enable the model to reason through text before generating images, We automatically generate and release a corpus of model crafted rationales paired with visual prompts, enabling controlled planning of object layouts, styles, and scene compositions. Our GRPO algorithm uses reward signals from a pretrained vision language model to assess overall visual quality, optimizing the policy in each update. Evaluations on GenEval, DPG, and the T2I benchmark demonstrate that ReasonGen-R1 consistently outperforms strong baselines and prior state-of-the-art models. More: aka.ms/reasongen.
Problem

Research questions and friction points this paper is trying to address.

Integrating CoT reasoning into autoregressive image generation models
Enhancing image generation via text-based reasoning and RL refinement
Automating rationale generation for controlled visual scene planning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage autoregressive image generation with reasoning
Supervised fine-tuning on model-crafted rationale dataset
Group Relative Policy Optimization for output refinement
🔎 Similar Papers
No similar papers found.