Evaluation of Best-of-N Sampling Strategies for Language Model Alignment

📅 2025-02-18

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

This work addresses the “reward hacking” problem in Best-of-N (BoN) sampling for aligning large language models, which arises from imperfect reward modeling. We propose two novel regularized BoN methods: Stochastic Regularized BoN (SRBoN), with theoretical guarantees under robust optimization, and Length-Regularized BoN. Our approach is the first to frame BoN regularization through a robust optimization lens, providing worst-case theoretical guarantees. Experiments on AlpacaFarm and HH-RLHF benchmarks demonstrate that Length-Regularized BoN significantly improves alignment with true human preferences—outperforming existing methods—while empirical results validate SRBoN’s core regularization mechanism. By unifying rigorous theoretical analysis with comprehensive empirical evaluation, this work advances the intersection of reward modeling, decoding-time alignment, and regularization design, delivering substantive progress in both methodology and practical performance.

Technology Category

Application Category

📝 Abstract

Best-of-N (BoN) sampling with a reward model has been shown to be an effective strategy for aligning Large Language Models (LLMs) with human preferences at the time of decoding. BoN sampling is susceptible to a problem known as reward hacking. Since the reward model is an imperfect proxy for the true objective, an excessive focus on optimizing its value can lead to a compromise of its performance on the true objective. Previous work proposes Regularized BoN sampling (RBoN), a BoN sampling with regularization to the objective, and shows that it outperforms BoN sampling so that it mitigates reward hacking and empirically (Jinnai et al., 2024). However, Jinnai et al. (2024) introduce RBoN based on a heuristic and they lack the analysis of why such regularization strategy improves the performance of BoN sampling. The aim of this study is to analyze the effect of BoN sampling on regularization strategies. Using the regularization strategies corresponds to robust optimization, which maximizes the worst case over a set of possible perturbations in the proxy reward. Although the theoretical guarantees are not directly applicable to RBoN, RBoN corresponds to a practical implementation. This paper proposes an extension of the RBoN framework, called Stochastic RBoN sampling (SRBoN), which is a theoretically guaranteed approach to worst-case RBoN in proxy reward. We then perform an empirical evaluation using the AlpacaFarm and Anthropic's hh-rlhf datasets to evaluate which factors of the regularization strategies contribute to the improvement of the true proxy reward. In addition, we also propose another simple RBoN method, the Sentence Length Regularized BoN, which has a better performance in the experiment as compared to the previous methods.

Problem

Research questions and friction points this paper is trying to address.

Analyzes regularization in Best-of-N sampling.

Proposes Stochastic RBoN for worst-case optimization.

Evaluates factors improving true proxy reward.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Stochastic RBoN for worst-case guarantees

Implements Sentence Length Regularized BoN

Evaluates regularization strategies on datasets

🔎 Similar Papers

The Crucial Role of Samplers in Online Direct Preference Optimization