🤖 AI Summary
To address the challenge of simultaneously accommodating diverse human preferences across multiple demographic groups and ensuring fairness in federated learning (FL) for large language models (LLMs), this paper proposes a decentralized preference alignment framework. Methodologically, it integrates reinforcement learning from human feedback (RLHF) based on proximal policy optimization (PPO) into the FL paradigm and introduces an adaptive preference weighting mechanism: client-specific reward aggregation weights are dynamically adjusted according to each client’s historical alignment performance. We systematically evaluate four aggregation strategies—min, max, mean, and adaptive. Our key contribution is the first FL-RLHF evaluation framework that jointly optimizes alignment quality and group fairness. Empirical validation on question-answering tasks demonstrates that adaptive aggregation preserves overall performance while significantly improving representativeness of minority-group preferences and cross-group fairness, outperforming conventional static aggregation methods.
📝 Abstract
This paper addresses the challenge of aligning large language models (LLMs) with diverse human preferences within federated learning (FL) environments, where standard methods often fail to adequately represent diverse viewpoints. We introduce a comprehensive evaluation framework that systematically assesses the trade-off between alignment quality and fairness when using different aggregation strategies for human preferences. In our federated setting, each group locally evaluates rollouts and produces reward signals, and the server aggregates these group-level rewards without accessing any raw data. Specifically, we evaluate standard reward aggregation techniques (min, max, and average) and introduce a novel adaptive scheme that dynamically adjusts preference weights based on a group's historical alignment performance. Our experiments on question-answering (Q/A) tasks using a PPO-based RLHF pipeline demonstrate that our adaptive approach consistently achieves superior fairness while maintaining competitive alignment scores. This work offers a robust methodology for evaluating LLM behavior across diverse populations and provides a practical solution for developing truly pluralistic and fairly aligned models.