🤖 AI Summary
This work addresses the challenge of aligning text-to-image generation with user intent, which often requires multiple rounds of trial and error due to the lack of dynamic awareness and adaptability in existing methods. Framing the task as a state-conditional sequential decision-making problem, the paper introduces PRE-GRPO (Peak-Retention-Efficiency Group Relative Policy Optimization), a trajectory-based reinforcement learning framework. PRE-GRPO employs a state-aware agent to dynamically guide the generation trajectory and jointly optimizes image quality, quality retention, and step efficiency at the trajectory level, thereby alleviating the credit assignment problem. Experimental results demonstrate that the proposed method significantly outperforms existing approaches across multiple benchmarks, achieving a WISE score of 0.90 and a T2I-ReasonBench reasoning accuracy of 79.06%.
📝 Abstract
Despite rapid advances in text-to-image generation, faithfully realizing user intent remains challenging, often requiring manual multi-turn trial and error. To automate this process, existing systems rely on either simple prompt rewriting or closed-loop agents driven by hand-crafted rules, rather than learning to adapt actions to the evolving generation process. In this paper, we reformulate image generation as a state-conditioned action-making problem and propose Generation Navigator, a multi-turn T2I agent that learns to dynamically steer the generation trajectory and output the next action. However, training this agent via reinforcement learning introduces a critical credit assignment challenge: naively rewarding a trajectory based solely on a single state assigns equal credit to all actions in the rollout, ignores the quality dynamics across turns, and fails to distinguish actions that improve the trajectory from those that degrade it or waste turns without progress. We resolve this with PRE-GRPO (Peak-Retention-Efficiency Group Relative Policy Optimization), a trajectory-level reinforcement learning objective that explicitly rewards discovering a high-quality image (Peak), avoiding subsequent quality degradation across turns (Retention), and minimizing unnecessary turns (Efficiency). Experiments show substantial improvements across benchmarks, reaching a WISE score of 0.90 and 79.06% reasoning accuracy on T2I-ReasonBench.