Provable Reinforcement Learning from Human Feedback with an Unknown Link Function

📅 2025-06-03

📈 Citations: 0

✨ Influential: 0

career value

245K/year

🤖 AI Summary

This paper addresses model misspecification in reinforcement learning from human feedback (RLHF) arising from unknown link functions. We propose ZSPO, the first theoretically guaranteed convergent model-free algorithm for RLHF without assuming a known link function (e.g., Bradley–Terry). ZSPO replaces reliance on parametric link models with zeroth-order preference sign estimation to approximate a policy gradient direction positively correlated with the true gradient. Our work establishes the first RLHF theoretical framework with provable polynomial convergence rates under arbitrary unknown link functions, introducing the “sign-driven policy optimization” paradigm. Experiments demonstrate that ZSPO significantly outperforms state-of-the-art methods—including DPO and PPO—under link function mismatch, achieving both theoretical rigor and empirical robustness.

Technology Category

Application Category

📝 Abstract

Link functions, which characterize how human preferences are generated from the value function of an RL problem, are a crucial component in designing RLHF algorithms. Almost all RLHF algorithms, including state-of-the-art ones in empirical studies such as DPO and PPO, assume the link function is known to the agent (e.g., a logistic function according to the Bradley-Terry model), which is arguably unrealistic considering the complex nature of human preferences. To avoid link function mis-specification, this paper studies general RLHF problems with unknown link functions. We propose a novel policy optimization algorithm called ZSPO based on a new zeroth-order policy optimization method, where the key is to use human preference to construct a parameter update direction that is positively correlated with the true policy gradient direction. ZSPO achieves it by estimating the sign of the value function difference instead of estimating the gradient from the value function difference, so it does not require knowing the link function. Under mild conditions, ZSPO converges to a stationary policy with a polynomial convergence rate depending on the number of policy iterations and trajectories per iteration. Numerical results also show the superiority of ZSPO under link function mismatch.

Problem

Research questions and friction points this paper is trying to address.

RLHF algorithms assume known link functions unrealistically

Proposes ZSPO for RLHF with unknown link functions

ZSPO estimates policy gradient sign without link function

Innovation

Methods, ideas, or system contributions that make the work stand out.

ZSPO algorithm for unknown link functions

Zeroth-order policy optimization method

Estimates sign of value function difference

🔎 Similar Papers

No similar papers found.