VRPO: Rethinking Value Modeling for Robust RL Training under Noisy Supervision

📅 2025-08-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In real-world RLHF, noisy human feedback degrades policy stability and generalization, particularly distorting advantage estimation by corrupting semantically critical signals. To address this, we propose a value-model-centric robust training framework that—uniquely—leverages the value model’s active role in noise suppression. Our method introduces an auxiliary loss grounded in language model entropy and perplexity, coupled with a variational information bottleneck to selectively encode discriminative semantic features. Integrated into the PPO framework, it freezes the language model to provide reliable uncertainty estimates, thereby improving value function estimation. Evaluated on mathematical reasoning, scientific question answering, and multi-turn dialogue tasks, our approach significantly outperforms PPO and GRPO baselines under both rule-injected and model-generated noisy reward settings. Results validate the effectiveness and generalizability of value-model-driven denoising as a novel paradigm for robust RLHF.

Technology Category

Application Category

📝 Abstract
Reinforcement Learning from Human Feedback (RLHF) often suffers from noisy or imperfect reward supervision in real-world settings, which undermines policy stability and generalization. Such noise may cause models to lose attention on key words during advantage estimation. While prior work focuses on reward denoising or filtering poor data, it often overlooks the critical role of the value model in policy optimization. In this work, we show that a strong value model is essential for mitigating noise by absorbing unstable signals and enabling more reliable advantage estimation. We propose VRPO, a value-centric framework for robust PPO training under noisy supervision. VRPO combines two core designs: (1) an auxiliary loss guided by entropy and perplexity from a frozen language model, and (2) a variational information bottleneck. These mechanisms enhance the value model's ability to filter out noise and capture key words from the context during advantage estimation, transforming it from a passive predictor into an active regulator of noise. Experiments on math reasoning, science QA, and multi-turn dialogue, under both rule-based and model-based noisy rewards, show that VRPO consistently outperforms PPO and GRPO baselines. Our findings underscore the often-overlooked importance of the value model in RLHF and offer a principled and practical approach to robust policy optimization in noisy real-world environments.
Problem

Research questions and friction points this paper is trying to address.

Mitigating noisy reward supervision in RLHF training
Enhancing value model robustness for reliable advantage estimation
Improving policy stability and generalization under noisy environments
Innovation

Methods, ideas, or system contributions that make the work stand out.

Value-centric framework for robust PPO training
Auxiliary loss with entropy and perplexity guidance
Variational information bottleneck for noise filtering
🔎 Similar Papers
No similar papers found.
D
Dingwei Zhu
College of Computer Science and Artificial Intelligence, Fudan University
Shihan Dou
Shihan Dou
Fudan University
LLMsCode LMsRLAlignment
Zhiheng Xi
Zhiheng Xi
Fudan University
LLM ReasoningLLM-based Agents
Senjie Jin
Senjie Jin
Fudan University
natural language processing
G
Guoqiang Zhang
College of Computer Science and Artificial Intelligence, Fudan University
Jiazheng Zhang
Jiazheng Zhang
Fudan University
Large Language ModelNatural Language ProcessingData Mining
J
Junjie Ye
College of Computer Science and Artificial Intelligence, Fudan University
Mingxu Chai
Mingxu Chai
Fudan University
E
Enyu Zhou
College of Computer Science and Artificial Intelligence, Fudan University
M
Ming Zhang
College of Computer Science and Artificial Intelligence, Fudan University
Caishuang Huang
Caishuang Huang
Fudan University
LLM、RLHF、Tool Learning
Y
Yunke Zhang
Honor Device Co., Ltd
Y
Yuran Wang
Honor Device Co., Ltd
T
Tao Gui
College of Computer Science and Artificial Intelligence, Fudan University