🤖 AI Summary
Existing LLM reinforcement learning methods (e.g., PPO, GRPO) over-optimize dominant reward signals, suppressing infrequent yet effective reasoning paths and thereby harming path diversity and generalization. This work proposes FlowRL, which abandons scalar reward maximization in favor of distributional alignment via a flow-balancing mechanism that matches the full reward distribution. Its core innovations are: (i) mapping scalar rewards to a learnable, normalized target distribution; and (ii) minimizing reverse KL divergence to achieve distributional alignment, coupled with a learnable partitioning function that explicitly promotes diverse exploration. Evaluated on mathematical and code reasoning tasks, FlowRL achieves average improvements of 10.0% over GRPO and 5.1% over PPO, consistently outperforming baselines across multiple benchmarks. The method significantly enhances reasoning robustness and generalization by preserving heterogeneous, high-value reasoning trajectories.
📝 Abstract
We propose FlowRL: matching the full reward distribution via flow balancing instead of maximizing rewards in large language model (LLM) reinforcement learning (RL). Recent advanced reasoning models adopt reward-maximizing methods (eg, PPO and GRPO), which tend to over-optimize dominant reward signals while neglecting less frequent but valid reasoning paths, thus reducing diversity. In contrast, we transform scalar rewards into a normalized target distribution using a learnable partition function, and then minimize the reverse KL divergence between the policy and the target distribution. We implement this idea as a flow-balanced optimization method that promotes diverse exploration and generalizable reasoning trajectories. We conduct experiments on math and code reasoning tasks: FlowRL achieves a significant average improvement of $10.0%$ over GRPO and $5.1%$ over PPO on math benchmarks, and performs consistently better on code reasoning tasks. These results highlight reward distribution-matching as a key step toward efficient exploration and diverse reasoning in LLM reinforcement learning.