Provable Reinforcement Learning from Human Feedback with an Unknown Link Function

📅 2025-06-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses model misspecification in reinforcement learning from human feedback (RLHF) arising from unknown link functions. We propose ZSPO, the first theoretically guaranteed convergent model-free algorithm for RLHF without assuming a known link function (e.g., Bradley–Terry). ZSPO replaces reliance on parametric link models with zeroth-order preference sign estimation to approximate a policy gradient direction positively correlated with the true gradient. Our work establishes the first RLHF theoretical framework with provable polynomial convergence rates under arbitrary unknown link functions, introducing the “sign-driven policy optimization” paradigm. Experiments demonstrate that ZSPO significantly outperforms state-of-the-art methods—including DPO and PPO—under link function mismatch, achieving both theoretical rigor and empirical robustness.

Technology Category

Application Category

📝 Abstract
Link functions, which characterize how human preferences are generated from the value function of an RL problem, are a crucial component in designing RLHF algorithms. Almost all RLHF algorithms, including state-of-the-art ones in empirical studies such as DPO and PPO, assume the link function is known to the agent (e.g., a logistic function according to the Bradley-Terry model), which is arguably unrealistic considering the complex nature of human preferences. To avoid link function mis-specification, this paper studies general RLHF problems with unknown link functions. We propose a novel policy optimization algorithm called ZSPO based on a new zeroth-order policy optimization method, where the key is to use human preference to construct a parameter update direction that is positively correlated with the true policy gradient direction. ZSPO achieves it by estimating the sign of the value function difference instead of estimating the gradient from the value function difference, so it does not require knowing the link function. Under mild conditions, ZSPO converges to a stationary policy with a polynomial convergence rate depending on the number of policy iterations and trajectories per iteration. Numerical results also show the superiority of ZSPO under link function mismatch.
Problem

Research questions and friction points this paper is trying to address.

RLHF algorithms assume known link functions unrealistically
Proposes ZSPO for RLHF with unknown link functions
ZSPO estimates policy gradient sign without link function
Innovation

Methods, ideas, or system contributions that make the work stand out.

ZSPO algorithm for unknown link functions
Zeroth-order policy optimization method
Estimates sign of value function difference
🔎 Similar Papers
No similar papers found.