FlowRL: Matching Reward Distributions for LLM Reasoning

📅 2025-09-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing LLM reinforcement learning methods (e.g., PPO, GRPO) over-optimize dominant reward signals, suppressing infrequent yet effective reasoning paths and thereby harming path diversity and generalization. This work proposes FlowRL, which abandons scalar reward maximization in favor of distributional alignment via a flow-balancing mechanism that matches the full reward distribution. Its core innovations are: (i) mapping scalar rewards to a learnable, normalized target distribution; and (ii) minimizing reverse KL divergence to achieve distributional alignment, coupled with a learnable partitioning function that explicitly promotes diverse exploration. Evaluated on mathematical and code reasoning tasks, FlowRL achieves average improvements of 10.0% over GRPO and 5.1% over PPO, consistently outperforming baselines across multiple benchmarks. The method significantly enhances reasoning robustness and generalization by preserving heterogeneous, high-value reasoning trajectories.

Technology Category

Application Category

📝 Abstract
We propose FlowRL: matching the full reward distribution via flow balancing instead of maximizing rewards in large language model (LLM) reinforcement learning (RL). Recent advanced reasoning models adopt reward-maximizing methods (eg, PPO and GRPO), which tend to over-optimize dominant reward signals while neglecting less frequent but valid reasoning paths, thus reducing diversity. In contrast, we transform scalar rewards into a normalized target distribution using a learnable partition function, and then minimize the reverse KL divergence between the policy and the target distribution. We implement this idea as a flow-balanced optimization method that promotes diverse exploration and generalizable reasoning trajectories. We conduct experiments on math and code reasoning tasks: FlowRL achieves a significant average improvement of $10.0%$ over GRPO and $5.1%$ over PPO on math benchmarks, and performs consistently better on code reasoning tasks. These results highlight reward distribution-matching as a key step toward efficient exploration and diverse reasoning in LLM reinforcement learning.
Problem

Research questions and friction points this paper is trying to address.

Matching reward distributions in LLM reinforcement learning
Addressing over-optimization of dominant reward signals
Promoting diverse exploration in reasoning trajectories
Innovation

Methods, ideas, or system contributions that make the work stand out.

Matching reward distributions via flow balancing
Minimizing reverse KL divergence for policy optimization
Promoting diverse exploration in reasoning trajectories
🔎 Similar Papers
No similar papers found.
Xuekai Zhu
Xuekai Zhu
Shanghai Jiao Tong University
Synthetic DataReasoningLanguage Model
Daixuan Cheng
Daixuan Cheng
Gaoling School of AI, Renmin University of China
LLM Pre-TrainingDomain AdaptationReasoning
D
Dinghuai Zhang
Microsoft Research
Hengli Li
Hengli Li
Institute for Artificial Intelligence, Peking University
Machine LearningNatural Language Processing
Kaiyan Zhang
Kaiyan Zhang
Tsinghua University
Foundation ModelCollective IntelligenceScientific Intelligence
Che Jiang
Che Jiang
Tsinghua University
Youbang Sun
Youbang Sun
Assistant Researcher, Tsinghua University; Northeastern University; Texas A&M University
Distributed OptimizationMulti-Agent RLRiemannian OptimizationFederated Learning
Ermo Hua
Ermo Hua
Tsinghua University
Physics-driven Foundation Model
Y
Yuxin Zuo
Tsinghua University
Xingtai Lv
Xingtai Lv
Tsinghua University
Large Language ModelNatural Language Processing
Q
Qizheng Zhang
Stanford University
L
Lin Chen
Shanghai Jiao Tong University
F
Fanghao Shao
Shanghai Jiao Tong University
B
Bo Xue
Shanghai Jiao Tong University
Yunchong Song
Yunchong Song
Ph.D. student, Shanghai Jiao Tong University
Machine Learning
Zhenjie Yang
Zhenjie Yang
Tsinghua University
Networking
Ganqu Cui
Ganqu Cui
Shanghai AI Lab
LLM AlignmentReinforcement Learning
N
Ning Ding
Tsinghua University, Shanghai AI Laboratory
J
Jianfeng Gao
Microsoft Research
X
Xiaodong Liu
Microsoft Research
B
Bowen Zhou
Tsinghua University, Shanghai AI Laboratory
Hongyuan Mei
Hongyuan Mei
Google DeepMind, TTIC, JHU, UChicago
ReasoningLarge Language ModelsNatural Language UnderstandingMachine Learning
Z
Zhouhan Lin
Shanghai Jiao Tong University, Shanghai AI Laboratory