On the Algorithmic Bias of Aligning Large Language Models with RLHF: Preference Collapse and Matching Regularization

📅 2024-05-26
🏛️ arXiv.org
📈 Citations: 26
Influential: 0
📄 PDF
🤖 AI Summary
This work identifies an inherent algorithmic bias in RLHF caused by KL-divergence-based regularization, leading to “preference collapse”—the systematic neglect of minority group preferences. To address this, we propose the Preference Matching (PM)-RLHF framework: (1) we formally define preference collapse for the first time; (2) we derive a novel matching regularizer via ordinary differential equation (ODE) analysis, providing theoretical guarantees under Bradley–Terry–Luce (BTL) and Plackett–Luce (PL) models that the LLM’s policy aligns consistently with the human preference distribution; and (3) we design a conditional PM-RLHF variant tailored for text generation. Experiments on OPT-1.3B and Llama-2-7B demonstrate that PM-RLHF improves human preference alignment by 29%–41% over standard RLHF and significantly mitigates suppression of minority preferences.

Technology Category

Application Category

📝 Abstract
Accurately aligning large language models (LLMs) with human preferences is crucial for informing fair, economically sound, and statistically efficient decision-making processes. However, we argue that reinforcement learning from human feedback (RLHF) -- the predominant approach for aligning LLMs with human preferences through a reward model -- suffers from an inherent algorithmic bias due to its Kullback--Leibler-based regularization in optimization. In extreme cases, this bias could lead to a phenomenon we term preference collapse, where minority preferences are virtually disregarded. To mitigate this algorithmic bias, we introduce preference matching (PM) RLHF, a novel approach that provably aligns LLMs with the preference distribution of the reward model under the Bradley--Terry--Luce/Plackett--Luce model. Central to our approach is a PM regularizer that takes the form of the negative logarithm of the LLM's policy probability distribution over responses, which helps the LLM balance response diversification and reward maximization. Notably, we obtain this regularizer by solving an ordinary differential equation that is necessary for the PM property. For practical implementation, we introduce a conditional variant of PM RLHF that is tailored to natural language generation. Finally, we empirically validate the effectiveness of conditional PM RLHF through experiments on the OPT-1.3B and Llama-2-7B models, demonstrating a 29% to 41% improvement in alignment with human preferences, as measured by a certain metric, compared to standard RLHF.
Problem

Research questions and friction points this paper is trying to address.

RLHF causes algorithmic bias and preference collapse
Proposes preference matching to align with reward distribution
Improves human preference alignment by 29-41%
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces preference matching RLHF algorithm
Uses PM regularizer for diversification balance
Solves ODE for preference matching property
🔎 Similar Papers
No similar papers found.