Strategyproof Reinforcement Learning from Human Feedback

📅 2025-03-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses strategic misalignment in multi-agent RLHF, where individual agents’ strategic feedback distorts policy learning. Existing methods lack strategyproofness: a single adversarial feedback can severely compromise learned policies. We prove a fundamental impossibility result—any strategyproof algorithm must incur at least a *k*-fold suboptimality gap relative to the optimal policy—revealing an inherent trade-off between incentive alignment and policy optimality. To overcome this limitation, we propose the first approximately strategy-robust, pessimistic median algorithm. It constructs pessimistic reward estimates under a coverage assumption, integrates median-based aggregation with multi-agent game-theoretic modeling, and guarantees asymptotic convergence to the optimal policy as both agent count and sample size grow—thereby significantly mitigating strategic policy shift.

Technology Category

Application Category

📝 Abstract
We study Reinforcement Learning from Human Feedback (RLHF), where multiple individuals with diverse preferences provide feedback strategically to sway the final policy in their favor. We show that existing RLHF methods are not strategyproof, which can result in learning a substantially misaligned policy even when only one out of $k$ individuals reports their preferences strategically. In turn, we also find that any strategyproof RLHF algorithm must perform $k$-times worse than the optimal policy, highlighting an inherent trade-off between incentive alignment and policy alignment. We then propose a pessimistic median algorithm that, under appropriate coverage assumptions, is approximately strategyproof and converges to the optimal policy as the number of individuals and samples increases.
Problem

Research questions and friction points this paper is trying to address.

Existing RLHF methods are not strategyproof, leading to misaligned policies.
Strategyproof RLHF algorithms face a trade-off between incentive and policy alignment.
Proposed pessimistic median algorithm achieves approximate strategyproofness and optimal policy convergence.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Strategyproof Reinforcement Learning from Human Feedback
Pessimistic median algorithm for alignment
Convergence to optimal policy with more data
🔎 Similar Papers
No similar papers found.