LLM-as-Judge on a Budget

📅 2026-02-17

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

This work addresses the problem of optimally allocating a fixed query budget to minimize estimation error in large language model (LLM)-as-Judge scoring. The authors propose a variance-adaptive sampling method that dynamically allocates query resources to prompt-response pairs with the highest uncertainty, leveraging multi-armed bandit techniques and concentration inequalities. As the first study to introduce a variance-aware dynamic budget allocation mechanism into the LLM-as-Judge evaluation framework, this approach achieves near-optimal theoretical error bounds. Empirical results on the Summarize-From-Feedback and HelpSteer2 datasets demonstrate significant improvements over uniform allocation strategies, effectively reducing worst-case estimation error under the same computational budget.

Technology Category

Application Category

📝 Abstract

LLM-as-a-judge has emerged as a cornerstone technique for evaluating large language models by leveraging LLM reasoning to score prompt-response pairs. Since LLM judgments are stochastic, practitioners commonly query each pair multiple times to estimate mean scores accurately. This raises a critical challenge: given a fixed computational budget $B$, how to optimally allocate queries across $K$ prompt-response pairs to minimize estimation error? % We present a principled variance-adaptive approach leveraging multi-armed bandit theory and concentration inequalities. Our method dynamically allocates queries based on estimated score variances, concentrating resources where uncertainty is highest. Further, our algorithm is shown to achieve a worst-case score-estimation error of $\tilde{O}\left(\sqrt{\frac{\sum_{i=1}^K σ_i^2}{B}}\right)$, $σ_i^2$ being the unknown score variance for pair $i \in [K]$ with near-optimal budget allocation. % Experiments on \emph{Summarize-From-Feedback} and \emph{HelpSteer2} demonstrate that our method significantly outperforms uniform allocation, reducing worst-case estimation error while maintaining identical budgets. Our work establishes a theoretical foundation for efficient LLM evaluation with practical implications for AI safety, model alignment, and automated assessment at scale.

Problem

Research questions and friction points this paper is trying to address.

LLM-as-Judge

computational budget

estimation error

query allocation

variance

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-as-Judge

variance-adaptive allocation

multi-armed bandit