🤖 AI Summary
This work addresses the statistical alignment challenge of large language models (LLMs) with diverse, potentially conflicting human preferences—particularly the theoretical inability of reward-based alignment methods (e.g., RLHF) to preserve minority preferences. To overcome this limitation, we propose Nash-aligned Language Modeling (NLHF), a novel reward-free alignment paradigm grounded in Nash equilibrium. We establish the first necessary and sufficient condition linking alignability of human preferences to the absence of Condorcet cycles. We further prove necessary and sufficient conditions for the existence of mixed-strategy Nash equilibria under NLHF and show that such equilibria hold with high probability. Additionally, we design the first efficient algorithm for computing these equilibria. Experiments on Llama-3.2-1B demonstrate that NLHF achieves a 60.55% win rate over baseline models, simultaneously preserving minority preferences and improving overall performance.
📝 Abstract
Aligning large language models (LLMs) with diverse human preferences is critical for ensuring fairness and informed outcomes when deploying these models for decision-making. In this paper, we seek to uncover fundamental statistical limits concerning aligning LLMs with human preferences, with a focus on the probabilistic representation of human preferences and the preservation of diverse preferences in aligned LLMs. We first show that human preferences can be represented by a reward model if and only if the preference among LLM-generated responses is free of any Condorcet cycle. Moreover, we prove that Condorcet cycles exist with probability converging to one exponentially fast under a probabilistic preference model, thereby demonstrating the impossibility of fully aligning human preferences using reward-based approaches such as reinforcement learning from human feedback. Next, we explore the conditions under which LLMs would employ mixed strategies -- meaning they do not collapse to a single response -- when aligned in the limit using a non-reward-based approach, such as Nash learning from human feedback (NLHF). We identify a necessary and sufficient condition for mixed strategies: the absence of a response that is preferred over all others by a majority. As a blessing, we prove that this condition holds with high probability under the probabilistic preference model, thereby highlighting the statistical possibility of preserving minority preferences without explicit regularization in aligning LLMs. Finally, we leverage insights from our statistical results to design a novel, computationally efficient algorithm for finding Nash equilibria in aligning LLMs with NLHF. Our experiments show that Llama-3.2-1B, aligned with our algorithm, achieves a win rate of 60.55% against the base model.