🤖 AI Summary
This work addresses the problem of inference-time alignment: how to improve the response quality of pretrained language models using additional computational resources, under imperfect reward modeling, while avoiding reward hacking and performance degradation. We propose InferenceTimePessimism, an algorithm that implements a pessimistic decision principle under uncertainty via rejection sampling. It is the first method to theoretically guarantee both computational scalability (monotonic improvement with increased computation) and asymptotic optimality for inference-time alignment, overcoming the reward-hacking pitfalls inherent in Best-of-N. Through probabilistic coverage analysis and rigorous theoretical proof, we establish that performance does not degrade as the sample size (N) increases. Empirical evaluation across multiple models and tasks demonstrates that InferenceTimePessimism consistently outperforms Best-of-N, achieving superior trade-offs between response quality and computational scalability.
📝 Abstract
Inference-time computation provides an important axis for scaling language model performance, but naively scaling compute through techniques like Best-of-$N$ sampling can cause performance to degrade due to reward hacking. Toward a theoretical understanding of how to best leverage additional computation, we focus on inference-time alignment which we formalize as the problem of improving a pre-trained policy's responses for a prompt of interest, given access to an imperfect reward model. We analyze the performance of inference-time alignment algorithms in terms of (i) response quality, and (ii) compute, and provide new results that highlight the importance of the pre-trained policy's coverage over high-quality responses for performance and compute scaling: 1. We show that Best-of-$N$ alignment with an ideal choice for $N$ can achieve optimal performance under stringent notions of coverage, but provably suffers from reward hacking when $N$ is large, and fails to achieve tight guarantees under more realistic coverage conditions. 2. We introduce $ exttt{InferenceTimePessimism}$, a new algorithm which mitigates reward hacking through deliberate use of inference-time compute, implementing the principle of pessimism in the face of uncertainty via rejection sampling; we prove that its performance is optimal and does not degrade with $N$, meaning it is scaling-monotonic. We complement our theoretical results with an experimental evaluation that demonstrate the benefits of $ exttt{InferenceTimePessimism}$ across a variety of tasks and models.