Jackpot! Alignment as a Maximal Lottery

๐Ÿ“… 2025-01-31
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing RLHF methods for aligning large language models (LLMs) with human values often neglect majority preferences, struggle with preference non-transitivity, and violate the independence of irrelevant alternatives (IIA). This work introduces Nash Learning from Human Feedback (NLHF), the first framework to incorporate *Maximal Lotteries*โ€”a social choice-theoretic solutionโ€”into LLM alignment. NLHF optimizes over stochastic policies to approximate the Maximal Lottery, thereby ensuring axiomatic guarantees of majority consistency, robustness to non-transitive preferences, and IIA compliance. Empirical evaluation demonstrates that NLHF significantly outperforms standard RLHF in respecting majority preferences, modeling non-transitive human judgments, and maintaining stability under irrelevant alternative perturbations. Consequently, NLHF yields responses better aligned with collective human intent.

Technology Category

Application Category

๐Ÿ“ Abstract
Reinforcement Learning from Human Feedback (RLHF), the standard for aligning Large Language Models (LLMs) with human values, is known to fail to satisfy properties that are intuitively desirable, such as respecting the preferences of the majority cite{ge2024axioms}. To overcome these issues, we propose the use of a probabilistic Social Choice rule called emph{maximal lotteries} as a replacement for RLHF. We show that a family of alignment techniques, namely Nash Learning from Human Feedback (NLHF) cite{munos2023nash} and variants, approximate maximal lottery outcomes and thus inherit its beneficial properties. We confirm experimentally that our proposed methodology handles situations that arise when working with preferences more robustly than standard RLHF, including supporting the preferences of the majority, providing principled ways of handling non-transitivities in the preference data, and robustness to irrelevant alternatives. This results in systems that better incorporate human values and respect human intentions.
Problem

Research questions and friction points this paper is trying to address.

Reinforcement Learning
Alignment
Human Values
Innovation

Methods, ideas, or system contributions that make the work stand out.

Maximal Random Sampling
Value Alignment
Human Preferences Optimization
๐Ÿ”Ž Similar Papers
No similar papers found.