🤖 AI Summary
In reinforcement learning, reward modeling suffers from “error-regret mismatch”: low test error of the reward model does not guarantee low regret of the optimized policy, primarily due to distributional shift induced by policy optimization.
Method: We provide the first theoretical proof that, for any arbitrarily small expected test error, there exist underlying data distributions yielding arbitrarily large regret. We construct explicit counterexamples, derive tight quantitative bounds linking reward estimation error and policy regret, and analyze the robustness of regularization techniques—including RLHF—against this mismatch.
Contribution/Results: We show that low test error only ensures a worst-case regret upper bound, not actual policy performance; moreover, standard regularizers fail to eliminate the mismatch. Our analysis establishes a new theoretical benchmark for assessing reward model reliability and safety alignment in preference-based RL, with implications for trustworthy reward learning and deployment-critical applications.
📝 Abstract
In reinforcement learning, specifying reward functions that capture the intended task can be very challenging. Reward learning aims to address this issue by learning the reward function. However, a learned reward model may have a low error on the data distribution, and yet subsequently produce a policy with large regret. We say that such a reward model has an error-regret mismatch. The main source of an error-regret mismatch is the distributional shift that commonly occurs during policy optimization. In this paper, we mathematically show that a sufficiently low expected test error of the reward model guarantees low worst-case regret, but that for any fixed expected test error, there exist realistic data distributions that allow for error-regret mismatch to occur. We then show that similar problems persist even when using policy regularization techniques, commonly employed in methods such as RLHF. We hope our results stimulate the theoretical and empirical study of improved methods to learn reward models, and better ways to measure their quality reliably.