Benchmarking and Evolving Reason-Reflect-Rectify for Reflective Visual Generation

📅 2026-05-19

📈 Citations: 0

✨ Influential: 0

career value

167K/year

🤖 AI Summary

This work addresses the limitations of existing text-to-image (T2I) models, which predominantly rely on single-pass generation and struggle with complex prompts requiring iterative refinement. To overcome this, the authors propose the Reason-Reflect-Rectify (R³) framework—a novel cyclic architecture that formally introduces mechanisms for reflection and rectification—and introduce R³-Bench, the first benchmark tailored to evaluate such capabilities. They further develop R³-Refiner, a two-stage optimization approach that integrates Group Relative Policy Optimization (GRPO) with a Hierarchical Reward Mechanism (HRM) to effectively align multimodal large language models with T2I generators. Experimental results demonstrate that the proposed method improves the Reflective Verdict Score by 12.0% and the Rectification Score by 9.0% on R³-Bench, while also achieving significant gains in image quality on GenEval++ and T2I-CompBench.

📝 Abstract

Text-to-Image (T2I) models and Unified Multimodal Models (UMMs) have achieved remarkable progress in visual generation. However, their reliance on a single-pass generation paradigm limits their ability to handle complex prompts requiring iterative refinement. To enable multi-round Reflective Visual Generation (RVG), we formalize the Reason-Reflect-Rectify (R^3) loop as a core framework and introduce R^3-Bench, a benchmark of over 600 expert-annotated instances that quantifies iterative reasoning and rectification capabilities. Evaluation on R^3-Bench reveals a critical gap: while state-of-the-art models can identify generation errors, they fail to generate actionable rectification instructions. To bridge this gap, we propose R^3-Refiner, a dual-stage framework leveraging Group Relative Policy Optimization (GRPO) and a Hierarchical Reward Mechanism (HRM) to better align rectification with reflective reasoning. Experiments show that R^3-Refiner achieves significant improvements on R^3-Bench (+12.0% in Reflective Verdict Score, +9.0% in Rectification Score), and can be seamlessly integrated with various MLLMs to enhance the generation quality of different T2I models on GenEval++ and T2I-CompBench. Code is available at https://github.com/xiaomoguhz/R3-Bench.

Problem

Research questions and friction points this paper is trying to address.

Reflective Visual Generation

Iterative Refinement

Rectification

Text-to-Image Generation

Multimodal Models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reason-Reflect-Rectify

Reflective Visual Generation

R3-Bench