Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning

📅 2025-08-28

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

Existing text-to-image (T2I) reinforcement learning approaches rely on pointwise reward models (RMs), making them vulnerable to reward hacking: minor score discrepancies are amplified after normalization, causing training instability and reward hijacking. Moreover, mainstream evaluation benchmarks operate at coarse granularity, hindering fine-grained assessment. To address these issues, we propose Pref-GRPO, which replaces pointwise scoring with pairwise preference modeling, incorporates normalized win-rate rewards, and employs intra-group comparison strategies. Additionally, we introduce UniGenBench—a fine-grained, multi-dimensional unified evaluation benchmark built upon multimodal large language models (MLLMs), comprising 600 diverse prompts. Pref-GRPO significantly improves training stability and interpretability while effectively mitigating reward hijacking. Empirical results demonstrate substantial gains in both image quality discrimination accuracy and model ranking capability compared to prior methods.

Technology Category

Application Category

📝 Abstract

Recent advancements highlight the importance of GRPO-based reinforcement learning methods and benchmarking in enhancing text-to-image (T2I) generation. However, current methods using pointwise reward models (RM) for scoring generated images are susceptible to reward hacking. We reveal that this happens when minimal score differences between images are amplified after normalization, creating illusory advantages that drive the model to over-optimize for trivial gains, ultimately destabilizing the image generation process. To address this, we propose Pref-GRPO, a pairwise preference reward-based GRPO method that shifts the optimization objective from score maximization to preference fitting, ensuring more stable training. In Pref-GRPO, images are pairwise compared within each group using preference RM, and the win rate is used as the reward signal. Extensive experiments demonstrate that PREF-GRPO differentiates subtle image quality differences, providing more stable advantages and mitigating reward hacking. Additionally, existing T2I benchmarks are limited by coarse evaluation criteria, hindering comprehensive model assessment. To solve this, we introduce UniGenBench, a unified T2I benchmark comprising 600 prompts across 5 main themes and 20 subthemes. It evaluates semantic consistency through 10 primary and 27 sub-criteria, leveraging MLLM for benchmark construction and evaluation. Our benchmarks uncover the strengths and weaknesses of both open and closed-source T2I models and validate the effectiveness of Pref-GRPO.

Problem

Research questions and friction points this paper is trying to address.

Addresses reward hacking in text-to-image reinforcement learning methods

Proposes pairwise preference reward model to stabilize image generation

Introduces comprehensive benchmark for evaluating text-to-image models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Pairwise preference reward model for stable training

Win rate as reward signal to prevent hacking

Unified benchmark with multi-criteria MLLM evaluation

🔎 Similar Papers

No similar papers found.