$ ext{G}^2$RPO: Granular GRPO for Precise Reward in Flow Models

📅 2025-10-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing flow-based models (e.g., GRPO) suffer from sparse and coarse-grained reward signals in online reinforcement learning, leading to inaccurate preference alignment. To address this, we propose Granular-GRPO—a novel framework built upon stochastic differential equation (SDE)-based flow models that deeply integrate RL into the generative process. Its core innovations are: (1) a singular stochastic sampling strategy that enhances correlation between injected noise and reward signals; and (2) a multi-granularity advantage fusion module enabling fine-grained, step-level reward assessment across temporal scales. Extensive experiments demonstrate that Granular-GRPO consistently outperforms baseline methods across diverse domains and under heterogeneous reward models. It significantly improves the accuracy, robustness, and generalization capability of reward estimation—particularly under sparse or multi-scale reward feedback.

Technology Category

Application Category

📝 Abstract
The integration of online reinforcement learning (RL) into diffusion and flow models has recently emerged as a promising approach for aligning generative models with human preferences. Stochastic sampling via Stochastic Differential Equations (SDE) is employed during the denoising process to generate diverse denoising directions for RL exploration. While existing methods effectively explore potential high-value samples, they suffer from sub-optimal preference alignment due to sparse and narrow reward signals. To address these challenges, we propose a novel Granular-GRPO ($ ext{G}^2$RPO ) framework that achieves precise and comprehensive reward assessments of sampling directions in reinforcement learning of flow models. Specifically, a Singular Stochastic Sampling strategy is introduced to support step-wise stochastic exploration while enforcing a high correlation between the reward and the injected noise, thereby facilitating a faithful reward for each SDE perturbation. Concurrently, to eliminate the bias inherent in fixed-granularity denoising, we introduce a Multi-Granularity Advantage Integration module that aggregates advantages computed at multiple diffusion scales, producing a more comprehensive and robust evaluation of the sampling directions. Experiments conducted on various reward models, including both in-domain and out-of-domain evaluations, demonstrate that our $ ext{G}^2$RPO significantly outperforms existing flow-based GRPO baselines,highlighting its effectiveness and robustness.
Problem

Research questions and friction points this paper is trying to address.

Addresses suboptimal preference alignment in flow model reinforcement learning
Improves sparse reward signals through granular stepwise assessment
Eliminates bias in denoising via multi-granularity advantage integration
Innovation

Methods, ideas, or system contributions that make the work stand out.

Granular-GRPO framework for precise reward assessments
Singular Stochastic Sampling strategy for step-wise exploration
Multi-Granularity Advantage Integration for robust evaluation
🔎 Similar Papers
No similar papers found.