🤖 AI Summary
This paper addresses the distortion in LLM-as-a-judge evaluations for system-level ranking tasks caused by system-level biases—i.e., consistent positive or negative preferences toward specific systems. To tackle this, we propose the first systematic, judge-oriented evaluation framework specifically designed for LLM-based system ranking. Methodologically, we introduce multi-turn response aggregation to generate robust system scores and enable reproducible automatic ranking; quantify judge bias and decisiveness against high-quality human rankings as ground truth; and conduct large-scale empirical evaluation. Key contributions include: (1) releasing the first benchmark dedicated to evaluating LLM judges on system ranking tasks; (2) empirically revealing a strong correlation between judge performance and system-level bias; and (3) establishing a novel, reliable, and interpretable paradigm for automated LLM evaluation—providing both theoretical foundations and practical tools for trustworthy LLM assessment.
📝 Abstract
Given the rapid progress of generative AI, there is a pressing need to systematically compare and choose between the numerous models and configurations available. The scale and versatility of such evaluations make the use of LLM-based judges a compelling solution for this challenge. Crucially, this approach requires first to validate the quality of the LLM judge itself. Previous work has focused on instance-based assessment of LLM judges, where a judge is evaluated over a set of responses, or response pairs, while being agnostic to their source systems. We argue that this setting overlooks critical factors affecting system-level ranking, such as a judge's positive or negative bias towards certain systems. To address this gap, we conduct the first large-scale study of LLM judges as system rankers. System scores are generated by aggregating judgment scores over multiple system outputs, and the judge's quality is assessed by comparing the resulting system ranking to a human-based ranking. Beyond overall judge assessment, our analysis provides a fine-grained characterization of judge behavior, including their decisiveness and bias.