What should post-training optimize? A test-time scaling law perspective

📅 2026-05-11
📈 Citations: 0
Influential: 0
📄 PDF

career value

192K/year
🤖 AI Summary
This work addresses the mismatch between post-training objectives and the best-of-N evaluation strategy used during deployment, particularly in realistic scenarios where the training sampling budget is substantially lower than that at test time. To bridge this gap, the authors propose a family of tail-extrapolated estimators—TEA and Prefix-TEA—that leverage structural assumptions about the upper tail of the reward distribution. By extrapolating tail statistics from limited samples and integrating moment-based debiasing with advantage function construction, these estimators effectively approximate the policy gradient of the best-of-N objective. Extensive experiments across diverse language models, reward models, and datasets demonstrate that the proposed approach significantly improves best-of-N performance under various training–testing budget configurations, achieving, for the first time, effective alignment between post-training objectives and deployment performance under low training budgets.
📝 Abstract
Large language models are increasingly deployed with test-time strategies: sample $N$ responses, score them with a reward model or verifier, and return the best. This deployment rule exposes a mismatch in post-training: standard objectives optimize the mean reward of a single response, whereas best-of-$N$ performance is governed by the upper tail of the reward distribution. Recent test-time-aware objectives partly address this mismatch, but typically assume that training can use the same per-prompt rollout budget as deployment, which is impractical when post-training must cover many prompts while deployment can allocate much larger per-prompt test-time compute. We study this budget-mismatch regime, where only $m\ll N$ per-prompt rollouts are available during training but the target objective is best-of-$N$ deployment. Under structural assumptions on the reward tails, we show that the policy gradient of the best-of-$N$ objective can be approximated from a much smaller rollout group by extrapolating upper-tail statistics. This yields a family of Tail-Extrapolated estimators for best-of-$N$-oriented post-training: a simple direct estimator, Tail-Extrapolated Advantage (TEA), and a fixed-order debiased Prefix-TEA estimator based on moment cancellation. Experiments on instruction-following tasks show that TEA and Prefix-TEA improve best-of-$N$ performance across different language models, reward models and datasets under various training and test-time budget settings.
Problem

Research questions and friction points this paper is trying to address.

post-training
best-of-N
reward distribution
test-time scaling
budget mismatch
Innovation

Methods, ideas, or system contributions that make the work stand out.

test-time scaling
best-of-N
tail extrapolation
post-training
policy gradient
🔎 Similar Papers
No similar papers found.