APPA: Adaptive Preference Pluralistic Alignment for Fair Federated RLHF of LLMs

📅 2026-04-05

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

This work addresses the challenge of fairly aligning diverse human values across multiple user groups in federated reinforcement learning from human feedback (FedRLHF) without centralizing preference data. The authors propose the Adaptive Preference Pluralistic Alignment (APPA) framework, which introduces a dynamic reward reweighting mechanism that operates without access to raw preference data. Within a PPO-based federated RLHF pipeline, APPA adaptively prioritizes under-aligned groups by leveraging historical alignment performance, while preserving the performance of already well-aligned groups. This approach overcomes the limitations of both averaging and worst-case (min-aggregation) strategies, achieving up to a 28% improvement in worst-group alignment performance on the GLOBALQA and OQA benchmarks, and consistently outperforming min-aggregation baselines in overall alignment across most experimental settings.

Technology Category

Application Category

📝 Abstract

Aligning large language models (LLMs) with diverse human preferences requires pluralistic alignment, where a single model must respect the values of multiple distinct groups simultaneously. In federated reinforcement learning from human feedback (FedRLHF), these groups align a shared policy without centralizing preference data, which makes fair reward aggregation essential. Existing aggregation methods exhibit clear trade offs: average based aggregation systematically under aligns worst performing groups, while min aggregation prioritizes worst group performance at the cost of overall alignment. We propose APPA, an Adaptive Preference Pluralistic Alignment framework that dynamically reweights group level rewards based on historical alignment rewards. Our approach prioritizes under aligned groups without degrading well aligned ones, while requiring no access to raw preference data. Integrated into a proximal policy optimization (PPO) based FedRLHF pipeline and evaluated on GLOBALQA and OQA across three model families (Gemma 2 2B, Llama 3.2 3B, Qwen3 0.6B), APPA achieves strong fairness alignment trade offs, improving worst group alignment by up to 28% over average aggregation while maintaining higher overall alignment than min aggregation across most configurations.

Problem

Research questions and friction points this paper is trying to address.

federated RLHF

preference pluralism

fair alignment

reward aggregation

large language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive Reweighting

Pluralistic Alignment

Fair Federated RLHF