Position: The Complexity of Perfect AI Alignment -- Formalizing the RLHF Trilemma

📅 2025-11-23

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This paper identifies and formalizes a fundamental “alignment trilemma” in Reinforcement Learning from Human Feedback (RLHF): large language models cannot simultaneously satisfy (i) ε-representativeness over diverse human values, (ii) polynomial sample and computational complexity, and (iii) δ-robustness against distributional shifts and adversarial perturbations. Leveraging statistical learning theory and robust optimization, the authors establish—via complexity-theoretic analysis—the intrinsic trade-off between representativeness and robustness. They prove that achieving global representativeness requires on the order of 10⁸ preference samples, and that the optimal operational complexity scales exponentially as Ω(2ᵈ) in context dimension d. This framework unifies explanations for empirical pathologies—including preference collapse, sycophancy, and bias amplification—and provides theoretical bounds and principled design guidelines for safe, fair, and robust alignment.

Technology Category

Application Category

📝 Abstract

Reinforcement Learning from Human Feedback (RLHF) is widely used for aligning large language models, yet practitioners face a persistent puzzle: improving safety often reduces fairness, scaling to diverse populations becomes computationally intractable, and making systems robust often amplifies majority biases. We formalize this tension as the Alignment Trilemma: no RLHF system can simultaneously achieve (i) epsilon-representativeness across diverse human values, (ii) polynomial tractability in sample and compute complexity, and (iii) delta-robustness against adversarial perturbations and distribution shift. Through a complexity-theoretic analysis integrating statistical learning theory and robust optimization, we prove that achieving both representativeness (epsilon <= 0.01) and robustness (delta <= 0.001) for global-scale populations requires Omega(2^{d_context}) operations, which is super-polynomial in the context dimensionality. We show that current RLHF implementations resolve this trilemma by sacrificing representativeness: they collect only 10^3--10^4 samples from homogeneous annotator pools while 10^7--10^8 samples are needed for true global representation. Our framework provides a unified explanation for documented RLHF pathologies including preference collapse, sycophancy, and systematic bias amplification. We conclude with concrete directions for navigating these fundamental trade-offs through strategic relaxations of alignment requirements.

Problem

Research questions and friction points this paper is trying to address.

Formalizing the RLHF trilemma where safety, fairness, and scalability conflict

Proving perfect AI alignment requires super-polynomial operations for global populations

Explaining how current RLHF implementations sacrifice representativeness to remain tractable

Innovation

Methods, ideas, or system contributions that make the work stand out.

Formalizes RLHF trilemma via complexity theory

Proves representativeness requires super-polynomial operations

Shows current RLHF sacrifices representativeness for tractability

🔎 Similar Papers

No similar papers found.

Authors to Follow