Best-of-Tails: Bridging Optimism and Pessimism in Inference-Time Alignment

📅 2026-03-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the trade-off in existing inference-time alignment methods between optimistic strategies (e.g., Best-of-N), which are vulnerable to reward hacking, and pessimistic regularization, which stifles exploration of high-quality responses. From a regret minimization perspective, the study establishes the first connection between the tail behavior of reward distributions and alignment strategies. It proposes a tail-adaptive dynamic interpolation mechanism that employs a Hill estimator to dynamically assess tail heaviness per prompt and uses Tsallis divergence as a tunable regularizer to adaptively balance exploration and robustness. Evaluated across mathematical reasoning, multiple-choice questions, and human preference assessments, the method consistently outperforms fixed-strategy baselines under diverse reference and reward model configurations.

Technology Category

Application Category

📝 Abstract
Inference-time alignment effectively steers large language models (LLMs) by generating multiple candidates from a reference model and selecting among them with an imperfect reward model. However, current strategies face a fundamental dilemma: ``optimistic''approaches like Best-of-$N$ suffer from reward hacking, while ``pessimistic''regularized methods often stifle the exploration needed to discover high-quality responses. In this work, we formalize this trade-off through the lens of regret minimization, demonstrating that the optimal strategy depends critically on the tail behavior of the reward distribution. We show theoretically that light-tailed regimes favor optimism to unearth high-quality outliers, whereas heavy-tailed regimes require pessimism to guard against reward mis-calibration in the extremes. Guided by this insight, we introduce Best-of-Tails (BoT), an adaptive inference-time alignment framework that uses Tsallis divergence as a tunable regularizer to provide a finer granularity of interpolation between these extremes. BoT uses the Hill estimator to characterize reward-tail heaviness on a per-prompt basis and dynamically adjusts its selection rule to balance exploration gains against alignment error. Across math, multiple-choice reasoning, and human-preference evaluations, BoT improves alignment performance across a range of reference and reward model configurations relative to fixed-strategy baselines.
Problem

Research questions and friction points this paper is trying to address.

inference-time alignment
reward hacking
optimism-pessimism trade-off
large language models
reward model
Innovation

Methods, ideas, or system contributions that make the work stand out.

Best-of-Tails
inference-time alignment
reward tail behavior
Tsallis divergence
Hill estimator
🔎 Similar Papers
2024-06-05arXiv.orgCitations: 1