Towards Reliable LLM Evaluation: Correcting the Winner's Curse in Adaptive Benchmarking

📅 2026-05-07

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

Large language models are susceptible to selection bias in adaptive prompting and program search, leading to an overestimation of the winning candidate’s performance under real-world deployment. This work proposes the SIREN protocol, which enables unbiased performance inference for the full tuning-to-deployment pipeline under a fixed tuning budget by freezing the candidate set, decoupling selection and evaluation data, and incorporating an entry-wise Gaussian multiplier bootstrap. SIREN is the first method to simultaneously support accurate estimation of program-level performance curves on limited-budget grids and construct confidence intervals for both within-budget and cross-budget comparisons. Empirical results demonstrate that conventional winner-reporting practices exhibit substantial optimistic bias, whereas SIREN closely approximates the true evaluation target under finite-sample conditions, offering reliable guidance for deployment decisions.

📝 Abstract

Adaptive prompt and program search makes LLM evaluation selection-sensitive. Once benchmark items are reused inside tuning, the observed winner's score need not estimate the fresh-data performance of the full tune-then-deploy procedure. We study inference for this procedure-level target under explicit tuning budgets. We propose SIREN, a selection-aware repeated-split reporting protocol that freezes the post-search shortlist, separates splitwise selection from held-out evaluation, and uses an item-level Gaussian multiplier bootstrap for uncertainty quantification. In a fixed-shortlist regime with smooth stabilized selection, the estimator admits a first-order item-level representation, and the bootstrap yields valid simultaneous inference on a finite budget grid. This supports confidence intervals for procedure-performance curves and pre-specified equal-budget and cross-budget comparisons. Controlled simulations and MMLU-Pro tuning experiments show that winner-based reporting can be optimistic and can change deployment conclusions, while SIREN remains close to the finite-sample reporting target.

Problem

Research questions and friction points this paper is trying to address.

LLM evaluation

winner's curse

adaptive benchmarking

selection bias

performance estimation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Winner's Curse

Adaptive Benchmarking

Selection-aware Inference