Inference-Time Reward Hacking in Large Language Models

📅 2025-06-23

📈 Citations: 0

✨ Influential: 0

career value

223K/year

🤖 AI Summary

This work addresses reward hacking—i.e., the degradation of correctness, safety, and alignment objectives during large language model (LLM) inference due to imperfect reward models. We propose a novel inference-time alignment mechanism based on a reward–KL divergence trade-off. Our method introduces Best-of-Poisson sampling, a computationally efficient strategy for approximating the optimal inference policy, and HedgeTune, an adaptive tuning algorithm for dynamic reward-KL balancing. We further provide the first systematic characterization of the “reward rise-then-fall” phenomenon—where reward scores initially increase but subsequently decline with excessive optimization—and propose a scalable mitigation framework. Experiments demonstrate that our approach substantially suppresses reward hacking at minimal computational overhead, achieving superior trade-offs between true reward and output deviation. The method establishes a new paradigm for lightweight, robust inference-time alignment.

Technology Category

Application Category

📝 Abstract

A common paradigm to improve the performance of large language models is optimizing for a reward model. Reward models assign a numerical score to LLM outputs indicating, for example, which response would likely be preferred by a user or is most aligned with safety goals. However, reward models are never perfect. They inevitably function as proxies for complex desiderata such as correctness, helpfulness, and safety. By overoptimizing for a misspecified reward, we can subvert intended alignment goals and reduce overall performance -- a phenomenon commonly referred to as reward hacking. In this work, we characterize reward hacking in inference-time alignment and demonstrate when and how we can mitigate it by hedging on the proxy reward. We study this phenomenon under Best-of-$n$ (BoN) and Soft-Best-of-$n$ (SBoN), and we introduce Best-of-Poisson (BoP) that provides an efficient, near-exact approximation of the optimal reward-KL divergence policy at inference time. We show that the characteristic pattern of hacking as observed in practice (where the true reward first increases before declining) is an inevitable property of a broad class of inference-time mechanisms, including BoN and BoP. To counter this effect, hedging offers a tactical choice to avoid placing undue confidence in high but potentially misleading proxy reward signals. We introduce HedgeTune, an efficient algorithm to find the optimal inference-time parameter and avoid reward hacking. We demonstrate through experiments that hedging mitigates reward hacking and achieves superior distortion-reward tradeoffs with minimal computational overhead.

Problem

Research questions and friction points this paper is trying to address.

Overoptimizing for imperfect reward models subverts alignment goals

Inference-time reward hacking reduces model performance and safety

Hedging mitigates reward hacking by balancing proxy and true rewards

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces Best-of-Poisson for reward-KL divergence

Proposes HedgeTune to avoid reward hacking

Studies reward hacking in inference-time alignment

🔎 Similar Papers

Lockpicking LLMs: A Logit-Based Jailbreak Using Token-level Manipulation