🤖 AI Summary
This work addresses the learning efficiency of preference modeling under KL regularization in reinforcement learning from human feedback (RLHF). We identify theoretical and practical limitations of existing approaches—particularly those based on the Bradley–Terry model and optimistic/pessimistic estimation. By analyzing a more general preference model, we reveal that the optimal policy class under KL regularization exhibits a unique structural property: the optimal policy distribution is uniquely determined by the marginal empirical frequencies of pairwise comparisons. Leveraging this insight, we propose a greedy sampling algorithm that relies solely on empirical estimates—requiring no confidence intervals or parametric model assumptions. We prove that our method achieves optimal sample complexity both under the general preference model and the Bradley–Terry special case, significantly outperforming prior methods. This work establishes, for the first time, the theoretical sufficiency of greedy sampling in RLHF, providing a rigorous foundation for efficient and practically simple algorithms.
📝 Abstract
Reinforcement Learning from Human Feedback (RLHF) has emerged as a key technique for post-training large language models. Despite its empirical success, the theoretical understanding of RLHF is still limited, as learning the KL-regularized target with only preference feedback poses additional challenges compared with canonical RL. Existing works mostly study the reward-based Bradley-Terry (BT) preference model, and extend classical designs utilizing optimism or pessimism. This work, instead, considers the general preference model (whose practical relevance has been observed recently) and obtains performance guarantees with major, order-wise improvements over existing ones. Surprisingly, these results are derived from algorithms that directly use the empirical estimates (i.e., greedy sampling), as opposed to constructing optimistic or pessimistic estimates in previous works. This insight has a deep root in the unique structural property of the optimal policy class under the KL-regularized target, and we further specialize it to the BT model, highlighting the surprising sufficiency of greedy sampling in RLHF.