Wasserstein Distributionally Robust Regret Optimization for Reinforcement Learning from Human Feedback

📅 2026-04-30

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

This work addresses the problem of reward over-optimization (Goodharting) in reinforcement learning from human feedback (RLHF), where learned proxy rewards diverge from true human utility. To mitigate this issue, the authors propose Distributionally Robust Regret Optimization (DRRO) based on the Wasserstein ambiguity set. Unlike conventional distributionally robust optimization (DRO) approaches that pessimistically estimate value functions, DRRO minimizes worst-case policy regret, yielding theoretically less conservative solutions and revealing that the optimal policy exhibits a water-filling structure. Leveraging an ℓ₁-Wasserstein uncertainty set, the method admits an exact inner solution, which is integrated with a simplex-based reward assignment model and a compatible policy gradient algorithm—adaptable to both PPO and GRPO—augmented by a sampling-based reward bonus mechanism. Empirical results demonstrate that DRRO effectively alleviates over-optimization, whereas standard DRO consistently exhibits excessive pessimism.

📝 Abstract

Reinforcement learning from human feedback (RLHF) has become a core post-training step for aligning large language models, yet the reward signal used in RLHF is only a learned proxy for true human utility. From an operations research perspective, this creates a decision problem under objective misspecification: the policy is optimized against an estimated reward, while deployment performance is determined by an unobserved objective. The resulting gap leads to reward over-optimization, or Goodharting, where proxy reward continues to improve even after true quality deteriorates. Existing mitigations address this problem through uncertainty penalties, pessimistic rewards, or conservative constraints, but they can be computationally burdensome and overly pessimistic. We propose Wasserstein distributionally robust regret optimization (DRRO) for RLHF. Instead of pessimizing worst-case value as in standard DRO, DRRO pessimizes worst-case regret relative to the best policy under the same plausible reward perturbation. We study the promptwise problem through a simplex allocation model and show that, under an $\ell_1$ ambiguity set, the inner worst-case regret admits an exact solution and the optimal policy has a water-filling structure. These results lead to a practical policy-gradient algorithm with a simple sampled-bonus interpretation and only minor changes to PPO/GRPO-style RLHF training. The framework also clarifies theoretically why DRRO is less pessimistic than DRO, and our experiments show that DRRO mitigates over-optimization more effectively than existing baselines while standard DRO is systematically over-pessimistic.

Problem

Research questions and friction points this paper is trying to address.

reward over-optimization

objective misspecification

Goodharting

distributionally robust optimization

reinforcement learning from human feedback

Innovation

Methods, ideas, or system contributions that make the work stand out.

Distributionally Robust Optimization

Regret Minimization

Reinforcement Learning from Human Feedback