Instance-Optimal Estimation with Multiple LLM Judges on a Budget

📅 2026-05-22

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

This study addresses the problem of optimally allocating a fixed budget of evaluation queries to accurately estimate scores assigned by heterogeneous large language models to prompt–response pairs of varying difficulty. The authors formulate this as a heteroscedastic multi-rater estimation problem under a budget constraint and propose EST-IVWE, an adaptive algorithm that combines inverse-variance-weighted estimation with an optimistic bias–variance estimator for efficient resource allocation. Theoretically, they establish a matching local minimax lower bound, introducing a novel Assouad-type expected argument based on local perturbations that preserves the variance structure and yields a tight bound. Experiments on both synthetic data and the HelpSteer2 dataset demonstrate that EST-IVWE significantly outperforms uniform allocation strategies, achieving error rates approaching those of the ideal allocation.

📝 Abstract

Evaluating large language models increasingly relies on LLM-as-a-judge protocols, but such evaluations remain costly: different judges have different prices and reliabilities, and the difficulty of each prompt-response pair can vary substantially. This raises a basic allocation question: under a fixed budget, how should one distribute evaluation queries across heterogeneous judges and instances to obtain the most accurate score estimates? We formalize this question as *budgeted heteroskedastic multi-judge estimation*. Given $K$ prompt-response pairs, $J$ judges with known costs, and unknown query-judge variances, the goal is to estimate a bounded score vector while minimizing an $\ell_p$-error. Our first contribution is to analyze the inverse-variance weighted estimator (IVWE) and to derive the oracle allocation that minimizes its error rate. Since this allocation depends on the unknown variances, we then address the practical unknown-variance setting by proposing EST-IVWE, an adaptive algorithm that constructs and leverages *optimistically biased* variance estimates to stabilize the empirical allocation. We prove that EST-IVWE matches the oracle IVWE rate up to lower-order terms in the budget. Our second and central theoretical contribution is a matching *local* minimax lower bound, which establishes the instance-optimality of the proposed algorithms. A key technical insight is that Fano-type high-probability arguments are too coarse for this problem: their packing construction loses the local variance structure that governs the optimal allocation. We instead use an Assouad-type in-expectation argument, based on local perturbations, which preserves this structure and yields the sharp allocation-dependent lower bound. Finally, we numerically validate the superiority of our approach over naïve uniform allocation on synthetic and HelpSteer2 datasets.

Problem

Research questions and friction points this paper is trying to address.

budgeted estimation

heteroskedasticity

multi-judge evaluation

instance-optimal

LLM-as-a-judge

Innovation

Methods, ideas, or system contributions that make the work stand out.

instance-optimality

budgeted estimation

heteroskedastic multi-judge