STAGE: Stable and Generalizable GRPO for Autoregressive Image Generation

📅 2025-09-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address training instability, degradation of pretrained capabilities, and poor generalization in autoregressive image generation with GRPO, this paper proposes STAGE—a novel framework integrating contrastive learning, entropy regularization, offline policy optimization, and autoregressive modeling. Its core contributions are: (1) a similarity-aware advantage/KL reweighting mechanism that alleviates token-level gradient conflicts and suppresses interference from irrelevant tokens; and (2) an entropy-based reward derived from a reference model, which dynamically stabilizes policy entropy to enhance robustness and cross-task generalization. Experiments demonstrate that STAGE significantly improves visual quality and training stability, effectively mitigates reward hacking, and achieves superior zero-shot transfer performance across multiple benchmarks—including text-to-image generation, image editing, and layout-to-image synthesis—without compromising pretrained knowledge.

Technology Category

Application Category

📝 Abstract
Reinforcement learning has recently been explored to improve text-to-image generation, yet applying existing GRPO algorithms to autoregressive (AR) image models remains challenging. The instability of the training process easily disrupts the pretrained model capability during long runs, resulting in marginal gains, degraded image quality, and poor generalization. In this work, we revisit GRPO for AR image generation and identify two key issues: contradictory gradients from unnecessary tokens and unstable policy entropy dynamics. To address these, we introduce STAGE, a stable and generalizable framework that leverages two targeted solutions: 1) Advantage/KL reweighting. Similarity-aware reweighting to alleviate conflicting updates; and 2) Entropy reward. An entropy-based reward corresponding to reference model to stabilize learning. With the help of alleviating conflicts between tokens and an entropy reward for stabilizing training, we reduce disruption of the pretrained distribution and mitigate reward hacking, which in turn improves generalization and transfer better to other benchmarks. Experiments across multiple benchmarks show that STAGE consistently improves visual quality, stability, and cross-task generalization compared to baseline GRPO.
Problem

Research questions and friction points this paper is trying to address.

Stabilizing reinforcement learning for autoregressive image generation
Resolving contradictory gradients from unnecessary tokens
Addressing unstable policy entropy dynamics during training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Advantage/KL reweighting to reduce conflicting token gradients
Entropy-based reward stabilizes policy learning dynamics
Framework mitigates reward hacking and improves generalization
🔎 Similar Papers
No similar papers found.