Zeroth-Order Policy Gradient for Reinforcement Learning from Human Feedback without Reward Inference

📅 2024-09-25

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

218K/year

🤖 AI Summary

This paper addresses reinforcement learning from human feedback (RLHF) under general stochastic Markov decision processes (MDPs), leveraging generalized preference models beyond Bradley–Terry. Method: We propose the first reward-model-free, provably convergent algorithmic framework for RLHF. Instead of modeling rewards, our approach directly estimates state-action value differences from human preferences, constructs zeroth-order policy gradients, and integrates local value-difference estimation with preference-driven gradient approximation. Contribution/Results: Theoretically, we establish the first polynomial convergence rate guarantee for RLHF in general MDPs—overcoming a key limitation of DPO, which only applies to bandits or deterministic MDPs. Empirically, our method significantly outperforms DPO and PPO in policy iteration efficiency, trajectory sample complexity, and per-iteration human query count, while ensuring rigorous convergence guarantees across all algorithmic components.

Technology Category

Application Category

📝 Abstract

Reward inference (learning a reward model from human preferences) is a critical intermediate step in the Reinforcement Learning from Human Feedback (RLHF) pipeline for fine-tuning Large Language Models (LLMs). In practice, RLHF faces fundamental challenges such as distribution shift, reward model overfitting, and problem misspecification. An alternative approach is direct policy optimization without reward inference, such as Direct Preference Optimization (DPO), which provides a much simpler pipeline and has shown empirical success in LLM applications. However, DPO utilizes the closed-form expression between the optimal policy and the reward function, which is only suitable under the bandit setting or deterministic MDPs. This paper develops two RLHF algorithms without reward inference for general RL problems beyond bandits and deterministic MDPs, and general preference models beyond the Bradley-Terry model. The key idea is to estimate the local value function difference from human preferences and then approximate the policy gradient with a zeroth-order gradient approximator. For both algorithms, we establish polynomial convergence rates in terms of the number of policy gradient iterations, the number of trajectory samples, and human preference queries per iteration. Numerical experiments in stochastic environments validate the performance of our proposed algorithms, outperforming popular RLHF baselines such as DPO and PPO. Our paper shows there exist provably efficient methods to solve general RLHF problems without reward inference.

Problem

Research questions and friction points this paper is trying to address.

Develops RLHF algorithms without reward inference for general RL problems.

Estimates local value function difference from human preferences.

Proposes zeroth-order gradient approximator for policy optimization.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Zeroth-order gradient approximator for policy optimization

Local value function difference estimation from preferences

General RLHF algorithms beyond bandits and deterministic MDPs

🔎 Similar Papers

No similar papers found.