π€ AI Summary
Existing reinforcement learning (RL) research for text-to-image (T2I) generation focuses predominantly on diffusion or autoregressive models, overlooking mask-based generative modelsβan efficient and promising paradigm. This work pioneers the integration of RL into mask-based T2I generation. We propose Mask-GRPO, a novel framework that formalizes iterative unmasking as a multi-step sequential decision-making process. Crucially, we adapt Group Relative Policy Optimization (GRPO) to mask modeling by redefining state transition probabilities, incorporating a KL-divergence constraint to stabilize training, applying latent-space dimensionality reduction for efficiency, and introducing a low-quality sample filtering mechanism. Evaluated on standard T2I benchmarks, Mask-GRPO achieves significant improvements in both image fidelity and human preference alignment, outperforming existing RL-based and supervised baselines. The implementation is publicly available.
π Abstract
Reinforcement learning (RL) has garnered increasing attention in text-to-image (T2I) generation. However, most existing RL approaches are tailored to either diffusion models or autoregressive models, overlooking an important alternative: masked generative models. In this work, we propose Mask-GRPO, the first method to incorporate Group Relative Policy Optimization (GRPO)-based RL into this overlooked paradigm. Our core insight is to redefine the transition probability, which is different from current approaches, and formulate the unmasking process as a multi-step decision-making problem. To further enhance our method, we explore several useful strategies, including removing the KL constraint, applying the reduction strategy, and filtering out low-quality samples. Using Mask-GRPO, we improve a base model, Show-o, with substantial improvements on standard T2I benchmarks and preference alignment, outperforming existing state-of-the-art approaches. The code is available on https://github.com/xingzhejun/Mask-GRPO