DVPO: Distributional Value Modeling-based Policy Optimization for LLM Post-Training

📅 2025-12-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address policy instability and poor generalization in large language model (LLM) post-training caused by noisy or incomplete reinforcement learning (RL) supervision, this paper proposes a distributed risk-aware RL framework. Methodologically, it introduces Conditional Value-at-Risk (CVaR) theory into token-level distributional value modeling for the first time, and designs an asymmetric risk regularization: contracting the lower tail to suppress noise-induced deviations while preserving the upper tail to retain exploratory diversity. This balances robustness against over-conservatism, thereby enhancing policy generalization. Experiments across multi-turn dialogue, mathematical reasoning, and scientific question answering demonstrate that our method consistently outperforms PPO, GRPO, and robust Bellman-PPO under noisy supervision—achieving superior stability and cross-task transferability.

Technology Category

Application Category

📝 Abstract
Reinforcement learning (RL) has shown strong performance in LLM post-training, but real-world deployment often involves noisy or incomplete supervision. In such settings, complex and unreliable supervision signals can destabilize training and harm generalization. While existing approaches such as worst-case optimization (e.g., RFQI, CQL) and mean-based methods (e.g., PPO, GRPO) can improve stability, they often overlook generalization and may produce overly conservative policies, leading to uneven performance across diverse real scenarios. To this end, we introduce DVPO (Distributional Value Modeling with Risk-aware Policy Optimization), a new RL framework that combines conditional risk theory with distributional value modeling to better balance robustness and generalization. DVPO learns token-level value distributions to provide fine-grained supervision, and applies an asymmetric risk regularization to shape the distribution tails: it contracts the lower tail to dampen noisy negative deviations, while expanding the upper tail to preserve exploratory diversity. Across extensive experiments and analysis in multi-turn dialogue, math reasoning, and scientific QA, DVPO consistently outperforms PPO, GRPO, and robust Bellman-based PPO under noisy supervision, showing its potential for LLM post-training in the real-world.
Problem

Research questions and friction points this paper is trying to address.

Address noisy supervision in LLM post-training
Balance robustness and generalization in RL
Improve policy performance across diverse scenarios
Innovation

Methods, ideas, or system contributions that make the work stand out.

Distributional value modeling for token-level supervision
Asymmetric risk regularization on distribution tails
Conditional risk theory for robustness-generalization balance
🔎 Similar Papers
No similar papers found.
D
Dingwei Zhu
College of Computer Science and Artificial Intelligence, Fudan University
Zhiheng Xi
Zhiheng Xi
Fudan University
LLM ReasoningLLM-based Agents
Shihan Dou
Shihan Dou
Fudan University
LLMsCode LMsRLAlignment
Y
Yuhui Wang
College of Computer Science and Artificial Intelligence, Fudan University
Sixian Li
Sixian Li
Master's degree student,Fudan University
NLP
J
Junjie Ye
College of Computer Science and Artificial Intelligence, Fudan University
Honglin Guo
Honglin Guo
Fudan University
Large Language Model
Shichun Liu
Shichun Liu
Fudan University
NLP
Chenhao Huang
Chenhao Huang
School of Computer Science, University of Sydney
Distributed data managementDistributed systems
Y
Yajie Yang
College of Computer Science and Artificial Intelligence, Fudan University
J
Junlin Shang
College of Computer Science and Artificial Intelligence, Fudan University
Senjie Jin
Senjie Jin
Fudan University
natural language processing
M
Ming Zhang
College of Computer Science and Artificial Intelligence, Fudan University
Jiazheng Zhang
Jiazheng Zhang
Fudan University
Large Language ModelNatural Language ProcessingData Mining
Caishuang Huang
Caishuang Huang
Fudan University
LLM、RLHF、Tool Learning
Y
Yunke Zhang
Honor Device Co., Ltd
D
Demei Yan
Honor Device Co., Ltd
Y
Yuran Wang
Honor Device Co., Ltd
T
Tao Gui
College of Computer Science and Artificial Intelligence, Fudan University