T2I-Eval-R1: Reinforcement Learning-Driven Reasoning for Interpretable Text-to-Image Evaluation

📅 2025-05-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current text-to-image (T2I) evaluation relies heavily on costly commercial models and high-quality human or LLM-generated critiques—creating bottlenecks in scalability and accessibility. Method: This paper proposes a lightweight, interpretable automatic evaluation framework. It introduces the first reinforcement learning paradigm for T2I assessment based on Group Relative Policy Optimization (GRPO); designs a continuous reward mechanism to enhance score diversity and training stability; and jointly generates scalar quality scores and natural language rationale chains—requiring only coarse-grained human ratings, without manually written justifications. Contribution/Results: The framework drastically reduces annotation cost and strengthens open-source multimodal large language models’ (MLLMs) evaluation capability. It achieves state-of-the-art performance across three major T2I meta-evaluation benchmarks, significantly outperforming all baselines. Human evaluation alignment and rationale accuracy are both markedly improved.

Technology Category

Application Category

📝 Abstract
The rapid progress in diffusion-based text-to-image (T2I) generation has created an urgent need for interpretable automatic evaluation methods that can assess the quality of generated images, therefore reducing the human annotation burden. To reduce the prohibitive cost of relying on commercial models for large-scale evaluation, and to improve the reasoning capabilities of open-source models, recent research has explored supervised fine-tuning (SFT) of multimodal large language models (MLLMs) as dedicated T2I evaluators. However, SFT approaches typically rely on high-quality critique datasets, which are either generated by proprietary LLMs-with potential issues of bias and inconsistency-or annotated by humans at high cost, limiting their scalability and generalization. To address these limitations, we propose T2I-Eval-R1, a novel reinforcement learning framework that trains open-source MLLMs using only coarse-grained quality scores, thereby avoiding the need for annotating high-quality interpretable evaluation rationale. Our approach integrates Group Relative Policy Optimization (GRPO) into the instruction-tuning process, enabling models to generate both scalar scores and interpretable reasoning chains with only easy accessible annotated judgment scores or preferences. Furthermore, we introduce a continuous reward formulation that encourages score diversity and provides stable optimization signals, leading to more robust and discriminative evaluation behavior. Experimental results on three established T2I meta-evaluation benchmarks demonstrate that T2I-Eval-R1 achieves significantly higher alignment with human assessments and offers more accurate interpretable score rationales compared to strong baseline methods.
Problem

Research questions and friction points this paper is trying to address.

Interpretable automatic evaluation for text-to-image generation quality
Reducing reliance on costly commercial models for large-scale evaluation
Training MLLMs with coarse-grained scores to avoid high-quality rationale annotation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement learning trains MLLMs without fine-grained annotations
GRPO integrates score and reasoning chain generation
Continuous reward formulation enhances evaluation robustness
🔎 Similar Papers
No similar papers found.
Z
Zi-Ao Ma
School of Computer Science and Technology, Beijing Institute of Technology, China
T
Tian Lan
School of Computer Science and Technology, Beijing Institute of Technology, China
Rong-Cheng Tu
Rong-Cheng Tu
Nanyang Technological University
Image and Video RetrievalCross-modal RetrievalDeep Learning
S
Shu-Hang Liu
School of Computer Science and Technology, Beijing Institute of Technology, China
H
Heyan Huang
School of Computer Science and Technology, Beijing Institute of Technology, China
Zhijing Wu
Zhijing Wu
Beijing Institute of Technology
Information RetrievalNatural Language Processing
C
Chen Xu
School of Medical Technology, Beijing Institute of Technology, China
Xian-Ling Mao
Xian-Ling Mao
Beijing Institute of Technology
Web Data MiningInformation ExtractionQA & DialogueTopic ModelingLearn to Hashing