🤖 AI Summary
This study addresses hidden errors in large language model (LLM) evaluation arising from unquantified factors such as prompt rewrites, changes in judge models, or temperature variations, which can destabilize results and even reverse model rankings. The work presents the first systematic decomposition of these error sources, distinguishing between random variance—diminishing with increased data—and systematic bias sensitive to design choices. It proposes an optimized evaluation pipeline leveraging variance decomposition, few-shot estimation, and projection-based optimization. Empirical results across multitask benchmarks including MMLU demonstrate that, at equivalent computational cost, the method reduces estimation error by 50%, outperforms 73% of baseline evaluation protocols, and yields confidence intervals achieving near-nominal coverage—substantially enhancing evaluation robustness and mitigating noise overfitting.
📝 Abstract
LLM evaluations drive which models get deployed, which safety standards get adopted, and which research conclusions get published. Yet these scores carry hidden uncertainty: rephrasing the prompt, switching the judge model, or changing the temperature can shift results enough to flip rankings and reverse conclusions. Standard confidence intervals ignore this variance, producing under-coverage that worsens with more data. The unmeasured variance also creates an exploitable surface: model developers can optimize against measurement noise rather than genuine capability. This paper decomposes LLM pipeline uncertainty into its sources, distinguishes variance that shrinks with more data from sensitivity to researcher design choices, and projects the most efficient path to reducing total error. For benchmark builders, the same decomposition identifies which design choices contribute exploitable surface for gaming and prescribes designs that minimize it. Across ideology annotation, safety classification, MMLU benchmarking, and a human-validated propaganda audit, projection-optimized pipelines outperform 73\% of possible naive pipelines against a human baseline. On MMLU, optimized budget allocation halves estimation error compared to standard single-prompt evaluation at equivalent cost. A small-sample variance estimation exercise is sufficient to derive confidence intervals that approach nominal coverage when the model includes the relevant pipeline facets, and to generate recommendations for reducing measurement error and improving benchmark robustness.