Theoretical Tensions in RLHF: Reconciling Empirical Success with Inconsistencies in Social Choice Theory

📅 2025-06-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the paradox that Reinforcement Learning from Human Feedback (RLHF) empirically succeeds despite violating foundational social choice axioms—namely, majority consistency and Condorcet consistency. We establish the first formal connection between RLHF and social choice theory, introducing three novel alignment criteria: preference matching, equivalence, and group matching—and systematically analyzing their satisfaction. Theoretically, under empirically plausible assumptions on preference distributions, we prove that RLHF satisfies pairwise majority and Condorcet consistency. Empirically, RLHF satisfies preference matching and equivalence but fails group matching. To address this gap, we formulate a verifiably consistent reward modeling objective and propose an optimization framework to achieve it. Our core contributions are: (i) uncovering the implicit consistency conditions underlying RLHF; (ii) proposing a more principled evaluation framework aligned with normative desiderata; and (iii) providing a provably consistent refinement path for RLHF-based alignment.

Technology Category

Application Category

📝 Abstract
Despite its empirical success, Reinforcement Learning from Human Feedback (RLHF) has been shown to violate almost all the fundamental axioms in social choice theory -- such as majority consistency, pairwise majority consistency, and Condorcet consistency. This raises a foundational question: why does RLHF perform so well in practice if it fails these seemingly essential properties? In this paper, we resolve this paradox by showing that under mild and empirically plausible assumptions on the preference profile, RLHF does satisfy pairwise majority and Condorcet consistency. These assumptions are frequently satisfied in real-world alignment tasks, offering a theoretical explanation for RLHF's strong practical performance. Furthermore, we show that a slight modification to the reward modeling objective can ensure pairwise majority or Condorcet consistency even under general preference profiles, thereby improving the alignment process. Finally, we go beyond classical axioms in economic and social choice theory and introduce new alignment criteria -- preference matching, preference equivalence, and group preference matching -- that better reflect the goal of learning distributions over responses. We show that while RLHF satisfies the first two properties, it fails to satisfy the third. We conclude by discussing how future alignment methods may be designed to satisfy all three.
Problem

Research questions and friction points this paper is trying to address.

Resolve RLHF's inconsistency with social choice theory axioms
Modify reward modeling to ensure pairwise majority consistency
Introduce new alignment criteria for learning response distributions
Innovation

Methods, ideas, or system contributions that make the work stand out.

RLHF satisfies majority consistency under mild assumptions
Modified reward modeling ensures consistency in general profiles
New alignment criteria introduced for better response distributions
🔎 Similar Papers
No similar papers found.