Best-of-N through the Smoothing Lens: KL Divergence and Regret Analysis

📅 2025-07-08

📈 Citations: 0

✨ Influential: 0

career value

224K/year

🤖 AI Summary

Best-of-N (BoN) inference-time alignment is highly sensitive to the quality of proxy reward models, leading to reward hacking and over-optimization. Method: We propose Soft Best-of-N (SBoN), a novel framework that replaces hard selection with a soft, entropy-regularized sampling mechanism based on reward scores. Contribution/Results: SBoN is the first to establish a theoretical analysis grounded in KL divergence and regret, revealing how sample size governs policy distribution shift under reward model imperfection. We prove that SBoN bounds policy deviation and reduces the regret upper bound, effectively mitigating optimization bias induced by low-fidelity rewards. Empirical results across multiple benchmarks demonstrate that SBoN maintains robustness as proxy reward quality degrades, substantially suppressing over-optimization, improving generation stability, and enhancing generalization—without requiring reward model retraining or additional supervision.

Technology Category

Application Category

📝 Abstract

A simple yet effective method for inference-time alignment of generative models is Best-of-$N$ (BoN), where $N$ outcomes are sampled from a reference policy, evaluated using a proxy reward model, and the highest-scoring one is selected. While prior work argues that BoN is almost optimal in reward vs KL tradeoffs, the effectiveness of BoN depends critically on the quality of the proxy reward model used for selection. For this purpose, we study BoN through a smooth version known as Soft Best-of-N (SBoN) and develop a theoretical framework to address this gap. We analyze the scaling behaviour of BoN by providing bounds on the KL divergence between the SBoN policy and the reference policy, offering insights into how performance varies with the number of samples. We also study the regret gap, i.e., the gap between the expected true reward under the optimal policy and the SBoN policy. Our theoretical and empirical findings show that smoothing helps SBoN mitigate reward overoptimization, especially when the quality of the proxy reward is low.

Problem

Research questions and friction points this paper is trying to address.

Analyzes KL divergence in Soft Best-of-N alignment

Studies regret gap between optimal and SBoN policies

Examines reward overoptimization mitigation via smoothing

Innovation

Methods, ideas, or system contributions that make the work stand out.

Soft Best-of-N mitigates reward overoptimization

Theoretical framework analyzes KL divergence scaling

Bounds on regret gap under suboptimal rewards

🔎 Similar Papers

Theoretical guarantees on the best-of-n alignment policy