🤖 AI Summary
Addressing the challenge of credible validation for LLM-as-a-Judge in the absence of human-annotated gold labels, this paper establishes the first gold-standard-free theoretical validation framework. It reveals that conventional aggregation-based methods are prone to selecting suboptimal judge models due to their neglect of systematic inter-judge biases. Methodologically, the work integrates multi-perspective scoring modeling, statistical consistency measurement, Monte Carlo sensitivity analysis, and cross-model performance benchmarking. Its core contribution is a more robust validation paradigm: empirical results demonstrate that judge models selected via this approach achieve a 34% performance improvement over those chosen by traditional methods. Furthermore, the paper delivers a plug-and-play validation workflow optimization protocol, providing both theoretical foundations and practical pathways for trustworthy GenAI evaluation.
📝 Abstract
The LLM-as-a-judge paradigm, in which a judge LLM system replaces human raters in rating the outputs of other generative AI (GenAI) systems, has come to play a critical role in scaling and standardizing GenAI evaluations. To validate judge systems, evaluators collect multiple human ratings for each item in a validation corpus, and then aggregate the ratings into a single, per-item gold label rating. High agreement rates between these gold labels and judge system ratings are then taken as a sign of good judge system performance. In many cases, however, items or rating criteria may be ambiguous, or there may be principled disagreement among human raters. In such settings, gold labels may not exist for many of the items. In this paper, we introduce a framework for LLM-as-a-judge validation in the absence of gold labels. We present a theoretical analysis drawing connections between different measures of judge system performance under different rating elicitation and aggregation schemes. We also demonstrate empirically that existing validation approaches can select judge systems that are highly suboptimal, performing as much as 34% worse than the systems selected by alternative approaches that we describe. Based on our findings, we provide concrete recommendations for developing more reliable approaches to LLM-as-a-judge validation.