STAGE: Stable and Generalizable GRPO for Autoregressive Image Generation

📅 2025-09-29

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

To address training instability, degradation of pretrained capabilities, and poor generalization in autoregressive image generation with GRPO, this paper proposes STAGE—a novel framework integrating contrastive learning, entropy regularization, offline policy optimization, and autoregressive modeling. Its core contributions are: (1) a similarity-aware advantage/KL reweighting mechanism that alleviates token-level gradient conflicts and suppresses interference from irrelevant tokens; and (2) an entropy-based reward derived from a reference model, which dynamically stabilizes policy entropy to enhance robustness and cross-task generalization. Experiments demonstrate that STAGE significantly improves visual quality and training stability, effectively mitigates reward hacking, and achieves superior zero-shot transfer performance across multiple benchmarks—including text-to-image generation, image editing, and layout-to-image synthesis—without compromising pretrained knowledge.

Technology Category

Application Category

📝 Abstract

Reinforcement learning has recently been explored to improve text-to-image generation, yet applying existing GRPO algorithms to autoregressive (AR) image models remains challenging. The instability of the training process easily disrupts the pretrained model capability during long runs, resulting in marginal gains, degraded image quality, and poor generalization. In this work, we revisit GRPO for AR image generation and identify two key issues: contradictory gradients from unnecessary tokens and unstable policy entropy dynamics. To address these, we introduce STAGE, a stable and generalizable framework that leverages two targeted solutions: 1) Advantage/KL reweighting. Similarity-aware reweighting to alleviate conflicting updates; and 2) Entropy reward. An entropy-based reward corresponding to reference model to stabilize learning. With the help of alleviating conflicts between tokens and an entropy reward for stabilizing training, we reduce disruption of the pretrained distribution and mitigate reward hacking, which in turn improves generalization and transfer better to other benchmarks. Experiments across multiple benchmarks show that STAGE consistently improves visual quality, stability, and cross-task generalization compared to baseline GRPO.

Problem

Research questions and friction points this paper is trying to address.

Stabilizing reinforcement learning for autoregressive image generation

Resolving contradictory gradients from unnecessary tokens

Addressing unstable policy entropy dynamics during training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Advantage/KL reweighting to reduce conflicting token gradients

Entropy-based reward stabilizes policy learning dynamics

Framework mitigates reward hacking and improves generalization

🔎 Similar Papers

Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation with Multimodal Generative Pretraining