Theoretical Limits of Language Model Alignment

📅 2026-05-07

📈 Citations: 0

✨ Influential: 0

career value

221K/year

🤖 AI Summary

This work addresses the fundamental question of the maximum achievable reward gain under a KL divergence constraint in language model alignment, a limit not well understood and significantly undershot by existing algorithms such as PPO and GRPO. The authors derive, for the first time from an information-theoretic perspective, a closed-form expression for the optimal expected reward gain under this constraint, revealing that it is governed by the Jeffreys divergence rather than the square root of KL divergence. They further propose a covariance-based estimator that relies solely on samples from the base model. Theoretically, they demonstrate that reward ensembling mitigates reward hacking. Experiments on safety and summarization tasks construct KL–reward Pareto fronts, showing that best-of-N sampling closely approaches the theoretical optimum, whereas PPO and GRPO exhibit substantial suboptimality, strongly corroborating the theoretical predictions.

📝 Abstract

Language model (LM) alignment improves model outputs to reflect human preferences while preserving the capabilities of the base model. The most common alignment approaches are (i) reinforcement learning, which maximizes the expected reward under a KL-divergence constraint, and (ii) best-of-$N$ alignment, which selects the highest-reward output among $N$ independent samples. Despite their widespread use, the fundamental limits of reward improvement under a KL budget remain poorly understood. We characterize the information-theoretic limits of KL-regularized alignment by deriving the maximum achievable expected reward gain for a fixed KL-divergence budget. Our first result provides a closed-form expression for the optimal reward improvement, governed by a Jeffreys divergence term rather than the $\sqrt{\texttt{KL}}$ used in prior analyses. We further reformulate this expression as a covariance under the base model, yielding a practical estimator that predicts achievable alignment gains from base model samples alone. We extend our analysis to the proxy reward setting, showing that the gap between ideal and proxy alignment (reward hacking) grows with the magnitude of reward error and when the KL penalty factor decreases. We then prove that reward ensembling mitigates reward hacking, providing a theoretical justification for this technique used in practice. Empirically, we compute the KL-reward Pareto frontier for two tasks for LMs, safety and summarization, and show that best-of-$N$ closely approaches the theoretical limit, while PPO and GRPO remain substantially suboptimal. Our theoretical results shed light on several empirically observed phenomena in the alignment literature and suggest that algorithmic improvements are needed to achieve optimal alignment without high inference costs.

Problem

Research questions and friction points this paper is trying to address.

language model alignment

KL-divergence budget

reward improvement

theoretical limits

reward hacking

Innovation

Methods, ideas, or system contributions that make the work stand out.

KL-regularized alignment

Jeffreys divergence

reward hacking