🤖 AI Summary
This paper addresses reinforcement learning from human feedback (RLHF) under general stochastic Markov decision processes (MDPs), leveraging generalized preference models beyond Bradley–Terry.
Method: We propose the first reward-model-free, provably convergent algorithmic framework for RLHF. Instead of modeling rewards, our approach directly estimates state-action value differences from human preferences, constructs zeroth-order policy gradients, and integrates local value-difference estimation with preference-driven gradient approximation.
Contribution/Results: Theoretically, we establish the first polynomial convergence rate guarantee for RLHF in general MDPs—overcoming a key limitation of DPO, which only applies to bandits or deterministic MDPs. Empirically, our method significantly outperforms DPO and PPO in policy iteration efficiency, trajectory sample complexity, and per-iteration human query count, while ensuring rigorous convergence guarantees across all algorithmic components.
📝 Abstract
Reward inference (learning a reward model from human preferences) is a critical intermediate step in the Reinforcement Learning from Human Feedback (RLHF) pipeline for fine-tuning Large Language Models (LLMs). In practice, RLHF faces fundamental challenges such as distribution shift, reward model overfitting, and problem misspecification. An alternative approach is direct policy optimization without reward inference, such as Direct Preference Optimization (DPO), which provides a much simpler pipeline and has shown empirical success in LLM applications. However, DPO utilizes the closed-form expression between the optimal policy and the reward function, which is only suitable under the bandit setting or deterministic MDPs. This paper develops two RLHF algorithms without reward inference for general RL problems beyond bandits and deterministic MDPs, and general preference models beyond the Bradley-Terry model. The key idea is to estimate the local value function difference from human preferences and then approximate the policy gradient with a zeroth-order gradient approximator. For both algorithms, we establish polynomial convergence rates in terms of the number of policy gradient iterations, the number of trajectory samples, and human preference queries per iteration. Numerical experiments in stochastic environments validate the performance of our proposed algorithms, outperforming popular RLHF baselines such as DPO and PPO. Our paper shows there exist provably efficient methods to solve general RLHF problems without reward inference.