From Curiosity to Caution: Mitigating Reward Hacking for Best-of-N with Pessimism

📅 2026-04-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the reward hacking problem in Best-of-N sampling, where imperfect reward models lead language models to exploit model vulnerabilities rather than generate genuinely high-quality responses as the candidate count \( N \) increases, ultimately degrading performance. To mitigate this, the authors propose a "Caution" mechanism that incorporates the pessimism principle from reinforcement learning into the inference phase of large language models. By estimating prediction errors to quantify uncertainty for out-of-distribution responses, the method automatically identifies and penalizes anomalous outputs that likely result from reward hacking. Integrating uncertainty estimation, reward correction, and out-of-distribution penalties, this approach significantly alleviates reward hacking across multiple tasks, enhances the stability of Best-of-N performance as \( N \) grows, and provides theoretical guarantees under linear settings.
📝 Abstract
Inference-time compute scaling has emerged as a powerful paradigm for improving language model performance on a wide range of tasks, but the question of how best to use the additional compute remains open. A popular approach is BoN sampling, where N candidate responses are generated, scored according to a reward model, and the highest-scoring response is selected. While this approach can improve performance, it is vulnerable to reward hacking, where performance degrades as N increases due to the selection of responses that exploit imperfections in the reward model instead of genuinely improving generation quality. Prior attempts to mitigate reward hacking, via stronger reward models or heavy-handed distributional regularization, either fail to fully address over-optimization or are too conservative to exploit additional compute. In this work, we explore the principle of pessimism in RL, which uses lower confidence bounds on value estimates to avoid OOD actions with uncertain reward estimates. Our approach, termed as caution, can be seen as the reverse of curiosity: where curiosity rewards prediction error as a signal of novelty, caution penalizes prediction error as a signal of distributional uncertainty. Practically, caution trains an error model on typical responses and uses its prediction error to lower reward estimates for atypical ones. Our extensive empirical evaluation demonstrates that caution is a simple, computationally efficient approach that substantially mitigates reward hacking in BoN sampling. We also provide a theoretical analysis in a simplified linear setting, which shows that caution provably improves over the standard BoN approach. Together, our results not only establish caution as a practical solution to reward hacking, but also provide evidence that curiosity-based approaches can be a general OOD detection technique in LLM settings.
Problem

Research questions and friction points this paper is trying to address.

reward hacking
Best-of-N
out-of-distribution
language models
inference-time scaling
Innovation

Methods, ideas, or system contributions that make the work stand out.

reward hacking
Best-of-N sampling
pessimism
out-of-distribution detection
caution
🔎 Similar Papers
No similar papers found.