Fairness Aware Reward Optimization

📅 2026-02-08

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

This work addresses the systemic unfairness in reward models and aligned large language models (LLMs) arising from group biases in human preference data. To mitigate this, the authors propose Faro, a framework that integrates fairness constraints—such as demographic parity, equalized odds, or counterfactual fairness—directly into reward model training. Faro provides the first theoretical guarantees of fairness at the reward level and demonstrates that such fairness can be effectively propagated to downstream policies. The framework also establishes a non-empty Pareto frontier between accuracy and fairness. By combining KL-regularized fine-tuning with fairness-constrained optimization, Faro ensures that reward models simultaneously preserve ordinal and cardinal properties while satisfying fairness criteria. Experiments across multiple LLMs and benchmarks show that Faro significantly reduces bias and harmful outputs without compromising—and sometimes even improving—overall performance.

Technology Category

Application Category

📝 Abstract

Demographic skews in human preference data propagate systematic unfairness through reward models into aligned LLMs. We introduce Fairness Aware Reward Optimization (Faro), an in-processing framework that trains reward models under demographic parity, equalized odds, or counterfactual fairness constraints. We provide the first theoretical analysis of reward-level fairness in LLM alignment, establishing: (i) provable fairness certificates for Faro-trained rewards with controllable slack; a (ii) formal characterization of the accuracy-fairness trade-off induced by KL-regularized fine-tuning, proving fairness transfers from reward to policy; and the (iii) existence of a non-empty Pareto frontier. Unlike pre- and post-processing methods, Faro ensures reward models are simultaneously ordinal (ranking correctly), cardinal (calibrated), and fair. Across multiple LLMs and benchmarks, Faro significantly reduces bias and harmful generations while maintaining or improving model quality.

Problem

Research questions and friction points this paper is trying to address.

fairness

reward models

demographic bias

LLM alignment

unfairness propagation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fairness Aware Reward Optimization

Reward Model Fairness

LLM Alignment