🤖 AI Summary
This study addresses the problem of optimally allocating a fixed budget of evaluation queries to accurately estimate scores assigned by heterogeneous large language models to prompt–response pairs of varying difficulty. The authors formulate this as a heteroscedastic multi-rater estimation problem under a budget constraint and propose EST-IVWE, an adaptive algorithm that combines inverse-variance-weighted estimation with an optimistic bias–variance estimator for efficient resource allocation. Theoretically, they establish a matching local minimax lower bound, introducing a novel Assouad-type expected argument based on local perturbations that preserves the variance structure and yields a tight bound. Experiments on both synthetic data and the HelpSteer2 dataset demonstrate that EST-IVWE significantly outperforms uniform allocation strategies, achieving error rates approaching those of the ideal allocation.
📝 Abstract
Evaluating large language models increasingly relies on LLM-as-a-judge protocols, but such evaluations remain costly: different judges have different prices and reliabilities, and the difficulty of each prompt-response pair can vary substantially. This raises a basic allocation question: under a fixed budget, how should one distribute evaluation queries across heterogeneous judges and instances to obtain the most accurate score estimates? We formalize this question as *budgeted heteroskedastic multi-judge estimation*. Given $K$ prompt-response pairs, $J$ judges with known costs, and unknown query-judge variances, the goal is to estimate a bounded score vector while minimizing an $\ell_p$-error. Our first contribution is to analyze the inverse-variance weighted estimator (IVWE) and to derive the oracle allocation that minimizes its error rate. Since this allocation depends on the unknown variances, we then address the practical unknown-variance setting by proposing EST-IVWE, an adaptive algorithm that constructs and leverages *optimistically biased* variance estimates to stabilize the empirical allocation. We prove that EST-IVWE matches the oracle IVWE rate up to lower-order terms in the budget. Our second and central theoretical contribution is a matching *local* minimax lower bound, which establishes the instance-optimality of the proposed algorithms. A key technical insight is that Fano-type high-probability arguments are too coarse for this problem: their packing construction loses the local variance structure that governs the optimal allocation. We instead use an Assouad-type in-expectation argument, based on local perturbations, which preserves this structure and yields the sharp allocation-dependent lower bound. Finally, we numerically validate the superiority of our approach over naïve uniform allocation on synthetic and HelpSteer2 datasets.