π€ AI Summary
Large language models are susceptible to selection bias in adaptive prompting and program search, leading to an overestimation of the winning candidateβs performance under real-world deployment. This work proposes the SIREN protocol, which enables unbiased performance inference for the full tuning-to-deployment pipeline under a fixed tuning budget by freezing the candidate set, decoupling selection and evaluation data, and incorporating an entry-wise Gaussian multiplier bootstrap. SIREN is the first method to simultaneously support accurate estimation of program-level performance curves on limited-budget grids and construct confidence intervals for both within-budget and cross-budget comparisons. Empirical results demonstrate that conventional winner-reporting practices exhibit substantial optimistic bias, whereas SIREN closely approximates the true evaluation target under finite-sample conditions, offering reliable guidance for deployment decisions.
π Abstract
Adaptive prompt and program search makes LLM evaluation selection-sensitive. Once benchmark items are reused inside tuning, the observed winner's score need not estimate the fresh-data performance of the full tune-then-deploy procedure. We study inference for this procedure-level target under explicit tuning budgets. We propose SIREN, a selection-aware repeated-split reporting protocol that freezes the post-search shortlist, separates splitwise selection from held-out evaluation, and uses an item-level Gaussian multiplier bootstrap for uncertainty quantification. In a fixed-shortlist regime with smooth stabilized selection, the estimator admits a first-order item-level representation, and the bootstrap yields valid simultaneous inference on a finite budget grid. This supports confidence intervals for procedure-performance curves and pre-specified equal-budget and cross-budget comparisons. Controlled simulations and MMLU-Pro tuning experiments show that winner-based reporting can be optimistic and can change deployment conclusions, while SIREN remains close to the finite-sample reporting target.