ReasonGen-R1: CoT for Autoregressive Image generation models through SFT and RL

📅 2025-05-30

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

This work addresses the weak reasoning capability in autoregressive image generation by pioneering the deep integration of Chain-of-Thought (CoT) reasoning with reinforcement learning (RL). Methodologically: (1) We construct the first multimodal reasoning corpus—pairing textual layouts, styles, and scene descriptions with corresponding images—and employ supervised fine-tuning to endow models with explicit, interpretable text-based reasoning; (2) We propose Group Relative Policy Optimization (GRPO), a novel RL algorithm that leverages a pretrained multimodal vision-language model as a reward estimator, enabling efficient and stable policy optimization. Experiments demonstrate state-of-the-art performance across three major benchmarks—GenEval, DPG, and T2I—achieving significant improvements in image plausibility, cross-element consistency, and fine-grained detail fidelity. Our approach establishes a new paradigm for controllable and interpretable image generation.

Technology Category

Application Category

📝 Abstract

Although chain-of-thought reasoning and reinforcement learning (RL) have driven breakthroughs in NLP, their integration into generative vision models remains underexplored. We introduce ReasonGen-R1, a two-stage framework that first imbues an autoregressive image generator with explicit text-based"thinking"skills via supervised fine-tuning on a newly generated reasoning dataset of written rationales, and then refines its outputs using Group Relative Policy Optimization. To enable the model to reason through text before generating images, We automatically generate and release a corpus of model crafted rationales paired with visual prompts, enabling controlled planning of object layouts, styles, and scene compositions. Our GRPO algorithm uses reward signals from a pretrained vision language model to assess overall visual quality, optimizing the policy in each update. Evaluations on GenEval, DPG, and the T2I benchmark demonstrate that ReasonGen-R1 consistently outperforms strong baselines and prior state-of-the-art models. More: aka.ms/reasongen.

Problem

Research questions and friction points this paper is trying to address.

Integrating CoT reasoning into autoregressive image generation models

Enhancing image generation via text-based reasoning and RL refinement

Automating rationale generation for controlled visual scene planning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage autoregressive image generation with reasoning

Supervised fine-tuning on model-crafted rationale dataset

Group Relative Policy Optimization for output refinement

🔎 Similar Papers

Regeneration Based Training-free Attribution of Fake Images Generated by Text-to-Image Generative Models