Power Reinforcement Post-Training of Text-to-Image Models with Super-Linear Advantage Shaping

📅 2026-05-11

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

This work addresses the susceptibility of existing reinforcement learning post-training methods to reward function bias, which often leads to reward hacking and difficulty in distinguishing genuine signals from noise. From an information-geometric perspective, the authors propose Superlinear Advantage Shaping (SLAS), a novel approach that constructs a nonlinear policy space using an advantage-weighted Fisher–Rao information metric and reshapes local update directions via batch normalization. This effectively suppresses spurious gradients while amplifying high-advantage signals. Integrated within the GRPO framework, SLAS consistently outperforms DanceGRPO across multiple backbone models and benchmarks, achieving faster convergence, enhanced out-of-domain generation capability, robustness to model scaling, and superior semantic and compositional fidelity.

📝 Abstract

Recently, post-training methods based on reinforcement learning, with a particular focus on Group Relative Policy Optimization (GRPO), have emerged as the robust paradigm for further advancement of text-to-image (T2I) models. However, these methods are often prone to reward hacking, wherein models exploit biases in imperfect reward functions rather than yielding genuine performance gains. In this work, we identify that normalization could lead to miscalibration and directly removing the prompt-level standard deviation term yields an optimal policy ascent direction that is linear in the advantage but still limits the separation of genuine signals from noise. To mitigate the above issues, we propose Super-Linear Advantage Shaping (SLAS) by revisiting the functional update from an information geometry perspective. By extending the Fisher-Rao information metric with advantage-dependent weighting, SLAS introduces a non-linear geometric structure that reshapes the local policy space. This design relaxes constraints along high-advantage directions to amplify informative updates, while tightening those in low-advantage regions to suppress illusory gradients. In addition, batch-level normalization is applied to stabilize training under varying reward scales. Extensive evaluations demonstrate that SLAS consistently surpasses the DanceGRPO baseline across multiple backbones and benchmarks. In particular, it yields faster training dynamics, improved out-of-domain performance on GenEval and UniGenBench++, and enhanced robustness to model scaling, while mitigating reward hacking and preserving semantic and compositional fidelity in generations.

Problem

Research questions and friction points this paper is trying to address.

reward hacking

text-to-image models

reinforcement learning

policy optimization

advantage shaping

Innovation

Methods, ideas, or system contributions that make the work stand out.

Super-Linear Advantage Shaping

Reinforcement Learning Post-Training

Information Geometry