Towards Reward Fairness in RLHF: From a Resource Allocation Perspective

📅 2025-05-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In RLHF, reward signals inherently encode biases that compromise LLM alignment fairness. Method: This work pioneers modeling preference learning as a resource allocation problem, jointly optimizing reward utility and group fairness. We propose a bias-agnostic fair learning paradigm featuring plug-and-play fairness regularization and a dynamic fairness coefficient mechanism, integrated into both reward modeling (RM training) and policy optimization (PPO). Preference data reweighting and a two-stage architectural design enable fairness-aware learning without sacrificing utility. Results: Our approach significantly improves group fairness of reward models across multiple benchmarks—reducing the Gini coefficient by up to 0.18—while preserving policy performance. It advances the fairness–accuracy Pareto frontier by 32%, establishing a new state-of-the-art in fair RLHF.

Technology Category

Application Category

📝 Abstract
Rewards serve as proxies for human preferences and play a crucial role in Reinforcement Learning from Human Feedback (RLHF). However, if these rewards are inherently imperfect, exhibiting various biases, they can adversely affect the alignment of large language models (LLMs). In this paper, we collectively define the various biases present in rewards as the problem of reward unfairness. We propose a bias-agnostic method to address the issue of reward fairness from a resource allocation perspective, without specifically designing for each type of bias, yet effectively mitigating them. Specifically, we model preference learning as a resource allocation problem, treating rewards as resources to be allocated while considering the trade-off between utility and fairness in their distribution. We propose two methods, Fairness Regularization and Fairness Coefficient, to achieve fairness in rewards. We apply our methods in both verification and reinforcement learning scenarios to obtain a fairness reward model and a policy model, respectively. Experiments conducted in these scenarios demonstrate that our approach aligns LLMs with human preferences in a more fair manner.
Problem

Research questions and friction points this paper is trying to address.

Addressing reward unfairness in RLHF from resource allocation perspective
Mitigating biases in rewards without specific design for each type
Aligning LLMs with human preferences more fairly via Fairness Regularization and Coefficient
Innovation

Methods, ideas, or system contributions that make the work stand out.

Model preference learning as resource allocation
Introduce Fairness Regularization for reward fairness
Apply Fairness Coefficient in RL scenarios
🔎 Similar Papers
No similar papers found.
O
Ouyang Sheng
Gaoling School of Artificial Intelligence, Renmin University of China; Beijing Key Laboratory of Research on Large Models and Intelligent Governance; Engineering Research Center of Next-Generation Intelligent Search and Recommendation, MOE; Kuaishou Technology
Y
Yulan Hu
Kuaishou Technology
Ge Chen
Ge Chen
Academy of Mathematics and Systems Science, Chinese Academy of Sciences,
multi-agent systemscomplex systemssocial networksrandom graphs
Qingyang Li
Qingyang Li
LLM and RLHF
Machine LearningReinforcement LearningDeep LearningAutonomous Driving
F
Fuzheng Zhang
Kuaishou Technology
Y
Yong Liu
Gaoling School of Artificial Intelligence, Renmin University of China; Beijing Key Laboratory of Research on Large Models and Intelligent Governance; Engineering Research Center of Next-Generation Intelligent Search and Recommendation, MOE; Kuaishou Technology