A Systematic Evaluation of Preference Aggregation in Federated RLHF for Pluralistic Alignment of LLMs

📅 2025-12-09

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

To address the challenge of simultaneously accommodating diverse human preferences across multiple demographic groups and ensuring fairness in federated learning (FL) for large language models (LLMs), this paper proposes a decentralized preference alignment framework. Methodologically, it integrates reinforcement learning from human feedback (RLHF) based on proximal policy optimization (PPO) into the FL paradigm and introduces an adaptive preference weighting mechanism: client-specific reward aggregation weights are dynamically adjusted according to each client’s historical alignment performance. We systematically evaluate four aggregation strategies—min, max, mean, and adaptive. Our key contribution is the first FL-RLHF evaluation framework that jointly optimizes alignment quality and group fairness. Empirical validation on question-answering tasks demonstrates that adaptive aggregation preserves overall performance while significantly improving representativeness of minority-group preferences and cross-group fairness, outperforming conventional static aggregation methods.

Technology Category

Application Category

📝 Abstract

This paper addresses the challenge of aligning large language models (LLMs) with diverse human preferences within federated learning (FL) environments, where standard methods often fail to adequately represent diverse viewpoints. We introduce a comprehensive evaluation framework that systematically assesses the trade-off between alignment quality and fairness when using different aggregation strategies for human preferences. In our federated setting, each group locally evaluates rollouts and produces reward signals, and the server aggregates these group-level rewards without accessing any raw data. Specifically, we evaluate standard reward aggregation techniques (min, max, and average) and introduce a novel adaptive scheme that dynamically adjusts preference weights based on a group's historical alignment performance. Our experiments on question-answering (Q/A) tasks using a PPO-based RLHF pipeline demonstrate that our adaptive approach consistently achieves superior fairness while maintaining competitive alignment scores. This work offers a robust methodology for evaluating LLM behavior across diverse populations and provides a practical solution for developing truly pluralistic and fairly aligned models.

Problem

Research questions and friction points this paper is trying to address.

Evaluates preference aggregation in federated RLHF for pluralistic LLM alignment

Assesses trade-offs between alignment quality and fairness across diverse groups

Introduces adaptive aggregation to improve fairness while maintaining competitive alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive preference aggregation adjusts weights dynamically

Evaluates reward aggregation techniques in federated RLHF

Ensures fairness while maintaining competitive alignment scores

🔎 Similar Papers

Towards Federated RLHF with Aggregated Client Preference for LLMs