The Perils of Optimizing Learned Reward Functions: Low Training Error Does Not Guarantee Low Regret

📅 2024-06-22

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

217K/year

🤖 AI Summary

In reinforcement learning, reward modeling suffers from “error-regret mismatch”: low test error of the reward model does not guarantee low regret of the optimized policy, primarily due to distributional shift induced by policy optimization. Method: We provide the first theoretical proof that, for any arbitrarily small expected test error, there exist underlying data distributions yielding arbitrarily large regret. We construct explicit counterexamples, derive tight quantitative bounds linking reward estimation error and policy regret, and analyze the robustness of regularization techniques—including RLHF—against this mismatch. Contribution/Results: We show that low test error only ensures a worst-case regret upper bound, not actual policy performance; moreover, standard regularizers fail to eliminate the mismatch. Our analysis establishes a new theoretical benchmark for assessing reward model reliability and safety alignment in preference-based RL, with implications for trustworthy reward learning and deployment-critical applications.

Technology Category

Application Category

📝 Abstract

In reinforcement learning, specifying reward functions that capture the intended task can be very challenging. Reward learning aims to address this issue by learning the reward function. However, a learned reward model may have a low error on the data distribution, and yet subsequently produce a policy with large regret. We say that such a reward model has an error-regret mismatch. The main source of an error-regret mismatch is the distributional shift that commonly occurs during policy optimization. In this paper, we mathematically show that a sufficiently low expected test error of the reward model guarantees low worst-case regret, but that for any fixed expected test error, there exist realistic data distributions that allow for error-regret mismatch to occur. We then show that similar problems persist even when using policy regularization techniques, commonly employed in methods such as RLHF. We hope our results stimulate the theoretical and empirical study of improved methods to learn reward models, and better ways to measure their quality reliably.

Problem

Research questions and friction points this paper is trying to address.

Learned reward functions may have low training error but high regret.

Distributional shift during policy optimization causes error-regret mismatch.

Policy regularization techniques do not fully resolve error-regret mismatch.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mathematical proof linking reward model error to regret

Analysis of error-regret mismatch in policy optimization

Evaluation of policy regularization techniques in RLHF

🔎 Similar Papers

REBEL: Reward Regularization-Based Approach for Robotic Reinforcement Learning from Human Feedback