E-GRPO: High Entropy Steps Drive Effective Reinforcement Learning for Flow Models

📅 2026-01-01

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

211K/year

🤖 AI Summary

This work addresses the limitations of existing multi-step denoising reinforcement learning methods in aligning with human preferences, which suffer from sparse and ambiguous reward signals as well as insufficient exploration due to low-entropy steps that reduce sample discriminability. To overcome these challenges, the authors propose E-GRPO, a novel approach featuring an entropy-aware mechanism that dynamically identifies and merges consecutive low-entropy steps into high-entropy sampling units. High-entropy steps are sampled using stochastic differential equations (SDEs) to enhance exploration, while ordinary differential equations (ODEs) are employed for the remaining steps to improve computational efficiency. Additionally, the method introduces a group-wise shared multi-step normalized advantage estimator combined with Group Relative Policy Optimization (GRPO) to mitigate reward ambiguity. Experiments demonstrate that E-GRPO significantly improves both training efficiency and alignment performance across diverse reward settings, validating the efficacy of entropy-driven strategies in flow-based reinforcement learning.

Technology Category

Application Category

📝 Abstract

Recent reinforcement learning has enhanced the flow matching models on human preference alignment. While stochastic sampling enables the exploration of denoising directions, existing methods which optimize over multiple denoising steps suffer from sparse and ambiguous reward signals. We observe that the high entropy steps enable more efficient and effective exploration while the low entropy steps result in undistinguished roll-outs. To this end, we propose E-GRPO, an entropy aware Group Relative Policy Optimization to increase the entropy of SDE sampling steps. Since the integration of stochastic differential equations suffer from ambiguous reward signals due to stochasticity from multiple steps, we specifically merge consecutive low entropy steps to formulate one high entropy step for SDE sampling, while applying ODE sampling on other steps. Building upon this, we introduce multi-step group normalized advantage, which computes group-relative advantages within samples sharing the same consolidated SDE denoising step. Experimental results on different reward settings have demonstrated the effectiveness of our methods.

Problem

Research questions and friction points this paper is trying to address.

reinforcement learning

flow models

reward sparsity

denoising steps

preference alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

entropy-aware reinforcement learning

flow matching

SDE/ODE sampling