Soft Best-of-n Sampling for Model Alignment

📅 2025-05-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the alignment of pre-trained language models without fine-tuning. We propose Soft Best-of-n (SBoN), a sampling method that smoothly interpolates between the original model distribution and the reward-maximizing distribution via a temperature parameter λ, thereby resolving the inherent trade-off in conventional Best-of-n (BoN) between KL divergence distortion and reward gain. SBoN constitutes the first continuous generalization of BoN, backed by a theoretical guarantee of O(1/n) convergence rate and an analysis revealing the fundamental limitations of discrete, block-wise sampling under skewed reward distributions. Leveraging tilted distribution modeling, rigorous KL divergence analysis, and an additive reward assumption, we establish asymptotic convergence. Experiments demonstrate that SBoN significantly improves expected relative reward while effectively constraining output distribution distortion—enabling fine-grained, tunable, and controllable optimization along the reward–distortion Pareto frontier.

Technology Category

Application Category

📝 Abstract
Best-of-$n$ (BoN) sampling is a practical approach for aligning language model outputs with human preferences without expensive fine-tuning. BoN sampling is performed by generating $n$ responses to a prompt and then selecting the sample that maximizes a reward function. BoN yields high reward values in practice at a distortion cost, as measured by the KL-divergence between the sampled and original distribution. This distortion is coarsely controlled by varying the number of samples: larger $n$ yields a higher reward at a higher distortion cost. We introduce Soft Best-of-$n$ sampling, a generalization of BoN that allows for smooth interpolation between the original distribution and reward-maximizing distribution through a temperature parameter $lambda$. We establish theoretical guarantees showing that Soft Best-of-$n$ sampling converges sharply to the optimal tilted distribution at a rate of $O(1/n)$ in KL and the expected (relative) reward. For sequences of discrete outputs, we analyze an additive reward model that reveals the fundamental limitations of blockwise sampling.
Problem

Research questions and friction points this paper is trying to address.

Aligning language model outputs with human preferences without fine-tuning
Controlling distortion cost in Best-of-n sampling via temperature parameter
Analyzing limitations of blockwise sampling for discrete output sequences
Innovation

Methods, ideas, or system contributions that make the work stand out.

Soft Best-of-n sampling with temperature parameter
Smooth interpolation between original and reward-maximizing distribution
Theoretical guarantees for optimal tilted distribution convergence
🔎 Similar Papers
C
C. M. Verdun
Harvard University, Allston, MA, USA
A
Alexander X. Oesterling
Harvard University, Allston, MA, USA
Himabindu Lakkaraju
Himabindu Lakkaraju
Assistant Professor, Harvard University; Senior Staff Research Scientist, Google.
AI Safety and AlignmentTrustworthy AIGenerative AIHuman-AI Collaboration
F
F. Calmon
Harvard University, Allston, MA, USA