🤖 AI Summary
This work addresses the instability of reinforcement learning in large language model post-training, which often stems from noisy supervision signals and poor out-of-distribution generalization. To mitigate these issues, the authors propose DFPO, a novel framework that extends value modeling from discrete quantiles to a continuous value flow field across time steps. DFPO incorporates trajectory-wise conditional risk control and consistency constraints to enable more precise state-dependent value estimation. Empirical results demonstrate that DFPO significantly outperforms baseline methods such as PPO and FlowRL across dialogue, mathematical reasoning, and scientific tasks, exhibiting superior training stability and generalization robustness under both noisy supervision and out-of-distribution evaluation scenarios.
📝 Abstract
Training reinforcement learning (RL) systems in real-world environments remains challenging due to noisy supervision and poor out-of-domain (OOD) generalization, especially in LLM post-training. Recent distributional RL methods improve robustness by modeling values with multiple quantile points, but they still learn each quantile independently as a scalar. This results in rough-grained value representations that lack fine-grained conditioning on state information, struggling under complex and OOD conditions. We propose DFPO (Distributional Value Flow Policy Optimization with Conditional Risk and Consistency Control), a robust distributional RL framework that models values as continuous flows across time steps. By scaling value modeling through learning of a value flow field instead of isolated quantile predictions, DFPO captures richer state information for more accurate advantage estimation. To stabilize training under noisy feedback, DFPO further integrates conditional risk control and consistency constraints along value flow trajectories. Experiments on dialogue, math reasoning, and scientific tasks show that DFPO outperforms PPO, FlowRL, and other robust baselines under noisy supervision, achieving improved training stability and generalization.