Who Judges the Judge? LLM Jury-on-Demand: Building Trustworthy LLM Evaluation Systems

πŸ“… 2025-12-01
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address the insufficient reliability assessment and poor scalability of large language models (LLMs) in high-stakes applications, this paper proposes LLM Jury-on-Demandβ€”a dynamic, learning-driven evaluation framework. Its core innovation is a reliability predictor that jointly models token-level distributions, embedding representations, and structured input features to capture fine-grained consistency between individual jury models and human judgments. Based on this predictor, the framework dynamically selects an optimal subset of models and performs weighted aggregation of their outputs. Unlike static ensembles or single-model evaluators, it enables context-aware, on-demand, and adaptive assessment. Empirical evaluation on summarization and retrieval-augmented generation (RAG) benchmarks demonstrates significantly higher correlation with human judgments compared to both single-model and fixed-jury baselines, validating the effectiveness and generalizability of the dynamic jury mechanism.

Technology Category

Application Category

πŸ“ Abstract
As Large Language Models (LLMs) become integrated into high-stakes domains, there is a growing need for evaluation methods that are both scalable for real-time deployment and reliable for critical decision-making. While human evaluation is reliable, it is slow and costly. Single LLM judges are biased, and static juries lack adaptability. To overcome these limitations, we propose LLM Jury-on-Demand - a dynamic, learning-based framework for scalable and context-aware evaluation. Our method trains a set of reliability predictors to assess when LLM judges will agree with human experts, leveraging token distributions, embeddings, and structural input features. This enables a fully adaptive evaluation where, for each data point, an optimal jury of the most reliable judges is dynamically selected, and their scores are aggregated using their reliability as weights. Experiments on summarization and RAG benchmarks show that our dynamic jury system achieves significantly higher correlation with human judgment than both single-judge and static-jury baselines. These results highlight the promise of adaptive, learning-based juries for building scalable, more reliable and trustworthy evaluation systems for modern LLMs in high-stakes domains.
Problem

Research questions and friction points this paper is trying to address.

Develops a dynamic jury system for scalable LLM evaluation
Addresses bias and adaptability issues in LLM judgment methods
Enhances reliability of LLM assessments in high-stakes applications
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic jury selection based on reliability predictors
Training predictors using token distributions and embeddings
Weighted aggregation of scores for adaptive evaluation
πŸ”Ž Similar Papers
No similar papers found.
Xiaochuan Li
Xiaochuan Li
Carnegie Mellon University
Machine LearningNatural Language Processing
K
Ke Wang
Wells Fargo Bank, N.A., USA
G
Girija Gouda
Wells Fargo Bank, N.A., USA
S
Shubham Choudhary
Wells Fargo Bank, N.A., USA
Y
Yaqun Wang
Wells Fargo Bank, N.A., USA
L
Linwei Hu
Wells Fargo Bank, N.A., USA
J
Joel Vaughan
Wells Fargo Bank, N.A., USA
Freddy Lecue
Freddy Lecue
Frontier AI Model Head @Wells Fargo - New York, USA. Ex-JPMorgan, -Thales, -Accenture, -IBM, -Orange
Frontier Artificial Intelligence ModelKnowledge Representation and Reasoning