PCPO: Proportionate Credit Policy Optimization for Aligning Image Generation Models

📅 2025-09-30
📈 Citations: 0
Influential: 0
📄 PDF

career value

171K/year
🤖 AI Summary
Existing policy-gradient-based alignment methods for text-to-image (T2I) generation suffer from imbalanced credit assignment across diffusion timesteps due to the intrinsic mathematical structure of stochastic samplers, leading to training instability, high gradient variance, and model collapse. This work identifies— for the first time—the structural origin of this issue and proposes Proportional Credit Policy Optimization (PCPO), a principled framework that reformulates the policy gradient objective and introduces a timestep-aware reweighting mechanism to ensure equitable gradient attribution across all diffusion steps. PCPO significantly reduces gradient variance, accelerates convergence, and consistently outperforms state-of-the-art methods—including DanceGRPO—on key metrics such as FID and CLIP Score. Moreover, it effectively mitigates model collapse in recursive training settings.

Technology Category

Application Category

📝 Abstract
While reinforcement learning has advanced the alignment of text-to-image (T2I) models, state-of-the-art policy gradient methods are still hampered by training instability and high variance, hindering convergence speed and compromising image quality. Our analysis identifies a key cause of this instability: disproportionate credit assignment, in which the mathematical structure of the generative sampler produces volatile and non-proportional feedback across timesteps. To address this, we introduce Proportionate Credit Policy Optimization (PCPO), a framework that enforces proportional credit assignment through a stable objective reformulation and a principled reweighting of timesteps. This correction stabilizes the training process, leading to significantly accelerated convergence and superior image quality. The improvement in quality is a direct result of mitigating model collapse, a common failure mode in recursive training. PCPO substantially outperforms existing policy gradient baselines on all fronts, including the state-of-the-art DanceGRPO.
Problem

Research questions and friction points this paper is trying to address.

Addresses training instability in text-to-image model alignment
Solves disproportionate credit assignment in generative samplers
Mitigates model collapse during recursive training processes
Innovation

Methods, ideas, or system contributions that make the work stand out.

Proportionate credit assignment for stable training
Objective reformulation and principled timestep reweighting
Mitigating model collapse through proportional feedback correction