GRPO-Guard: Mitigating Implicit Over-Optimization in Flow Matching via Regulated Clipping

📅 2025-10-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
GRPO in flow-matching-based diffusion models suffers from importance ratio distribution shift—characterized by subunitary means and inconsistent variances across denoising steps—rendering PPO clipping ineffective. Consequently, positive-advantage samples fail to constrain overconfident policy updates, leading to implicit over-optimization: surrogate reward increases while image fidelity and text-image alignment deteriorate significantly. Method: We propose GRPO-Guard, a dual-mechanism framework that jointly normalizes importance ratios and reweights gradients to enforce ratio distribution equilibrium across denoising steps, thereby restoring the efficacy of PPO clipping in suppressing deleterious updates and reducing reliance on strong KL regularization. Contribution/Results: Evaluated on SD3.5M, Flux.1-dev, and diverse proxy tasks, GRPO-Guard substantially mitigates over-optimization, enhances training stability, and preserves—or even improves—generation quality and semantic alignment.

Technology Category

Application Category

📝 Abstract
Recently, GRPO-based reinforcement learning has shown remarkable progress in optimizing flow-matching models, effectively improving their alignment with task-specific rewards. Within these frameworks, the policy update relies on importance-ratio clipping to constrain overconfident positive and negative gradients. However, in practice, we observe a systematic shift in the importance-ratio distribution-its mean falls below 1 and its variance differs substantially across timesteps. This left-shifted and inconsistent distribution prevents positive-advantage samples from entering the clipped region, causing the mechanism to fail in constraining overconfident positive updates. As a result, the policy model inevitably enters an implicit over-optimization stage-while the proxy reward continues to increase, essential metrics such as image quality and text-prompt alignment deteriorate sharply, ultimately making the learned policy impractical for real-world use. To address this issue, we introduce GRPO-Guard, a simple yet effective enhancement to existing GRPO frameworks. Our method incorporates ratio normalization, which restores a balanced and step-consistent importance ratio, ensuring that PPO clipping properly constrains harmful updates across denoising timesteps. In addition, a gradient reweighting strategy equalizes policy gradients over noise conditions, preventing excessive updates from particular timestep regions. Together, these designs act as a regulated clipping mechanism, stabilizing optimization and substantially mitigating implicit over-optimization without relying on heavy KL regularization. Extensive experiments on multiple diffusion backbones (e.g., SD3.5M, Flux.1-dev) and diverse proxy tasks demonstrate that GRPO-Guard significantly reduces over-optimization while maintaining or even improving generation quality.
Problem

Research questions and friction points this paper is trying to address.

Addresses systematic shift in importance-ratio distribution in GRPO frameworks
Mitigates implicit over-optimization where rewards increase but quality deteriorates
Prevents excessive policy updates from particular timestep regions in flow matching
Innovation

Methods, ideas, or system contributions that make the work stand out.

Normalizes importance ratios for balanced clipping
Reweights gradients across noise conditions equally
Regulates clipping to stabilize flow matching optimization
🔎 Similar Papers
No similar papers found.