🤖 AI Summary
Existing RLHF methods employ a single reward model, overlooking individual heterogeneity in human preferences and thus failing to align adequately with minority groups. Method: We first establish a theoretical impossibility result—under diverse preferences, no single reward model can simultaneously satisfy basic fairness and consistency. To address this, we propose a Fair Alignment Framework grounded in egalitarian principles from social choice theory, formulating a distributionally robust MaxMin alignment objective. Our approach integrates mixture modeling of preference distributions, an Expectation-Maximization algorithm, and general utility-based reinforcement learning for optimization. Results: Experiments on GPT-2 and Tulu2-7B demonstrate a >16% average win-rate improvement overall, a >33% win-rate gain for minority groups, and no performance degradation for majority groups—significantly enhancing both fairness and out-of-distribution robustness.
📝 Abstract
Reinforcement Learning from Human Feedback (RLHF) aligns language models to human preferences by employing a singular reward model derived from preference data. However, such an approach overlooks the rich diversity of human preferences inherent in data collected from multiple users. In this work, we first derive an impossibility result of alignment with single reward RLHF, thereby highlighting its insufficiency in representing diverse human preferences. To provide an equitable solution to the problem, we learn a mixture of preference distributions via an expectation-maximization algorithm and propose a MaxMin alignment objective for policy learning inspired by the Egalitarian principle in social choice theory to better represent diverse human preferences. We elucidate the connection of our proposed approach to distributionally robust optimization and general utility RL, thereby highlighting the generality and robustness of our proposed solution. We present comprehensive experimental results on small-scale (GPT-2) and large-scale language models (with Tulu2-7B) and show the efficacy of the proposed approach in the presence of diversity among human preferences. Our algorithm achieves an average improvement of more than 16% in win-rates over conventional RLHF algorithms and improves the win-rate (accuracy) for minority groups by over 33% without compromising the performance of majority groups, showcasing the robustness and fairness of our approach. We remark that our findings in this work are not only limited to language models but also extend to reinforcement learning in general.