A Systematic Evaluation of Preference Aggregation in Federated RLHF for Pluralistic Alignment of LLMs

📅 2025-12-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of simultaneously accommodating diverse human preferences across multiple demographic groups and ensuring fairness in federated learning (FL) for large language models (LLMs), this paper proposes a decentralized preference alignment framework. Methodologically, it integrates reinforcement learning from human feedback (RLHF) based on proximal policy optimization (PPO) into the FL paradigm and introduces an adaptive preference weighting mechanism: client-specific reward aggregation weights are dynamically adjusted according to each client’s historical alignment performance. We systematically evaluate four aggregation strategies—min, max, mean, and adaptive. Our key contribution is the first FL-RLHF evaluation framework that jointly optimizes alignment quality and group fairness. Empirical validation on question-answering tasks demonstrates that adaptive aggregation preserves overall performance while significantly improving representativeness of minority-group preferences and cross-group fairness, outperforming conventional static aggregation methods.

Technology Category

Application Category

📝 Abstract
This paper addresses the challenge of aligning large language models (LLMs) with diverse human preferences within federated learning (FL) environments, where standard methods often fail to adequately represent diverse viewpoints. We introduce a comprehensive evaluation framework that systematically assesses the trade-off between alignment quality and fairness when using different aggregation strategies for human preferences. In our federated setting, each group locally evaluates rollouts and produces reward signals, and the server aggregates these group-level rewards without accessing any raw data. Specifically, we evaluate standard reward aggregation techniques (min, max, and average) and introduce a novel adaptive scheme that dynamically adjusts preference weights based on a group's historical alignment performance. Our experiments on question-answering (Q/A) tasks using a PPO-based RLHF pipeline demonstrate that our adaptive approach consistently achieves superior fairness while maintaining competitive alignment scores. This work offers a robust methodology for evaluating LLM behavior across diverse populations and provides a practical solution for developing truly pluralistic and fairly aligned models.
Problem

Research questions and friction points this paper is trying to address.

Evaluates preference aggregation in federated RLHF for pluralistic LLM alignment
Assesses trade-offs between alignment quality and fairness across diverse groups
Introduces adaptive aggregation to improve fairness while maintaining competitive alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive preference aggregation adjusts weights dynamically
Evaluates reward aggregation techniques in federated RLHF
Ensures fairness while maintaining competitive alignment scores
🔎 Similar Papers
No similar papers found.
M
Mahmoud Srewa
Department of Electrical Engineering and Computer Science, University of California, Irvine, Irvine, CA 92697, USA
T
Tianyu Zhao
Department of Electrical Engineering and Computer Science, University of California, Irvine, Irvine, CA 92697, USA
Salma Elmalaki
Salma Elmalaki
EECS Department at University of California, Irvine
Human FactorsCPSMobile ComputingExtended Reality