Leveraging LLMs as Meta-Judges: A Multi-Agent Framework for Evaluating LLM Judgments

📅 2025-04-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address two key challenges in LLM response evaluation—human judgment bias and difficulty in selecting optimal responses from multiple candidates—this paper proposes a three-stage meta-evaluation framework. First, human-AI collaboration constructs a fine-grained, interpretable scoring rubric. Second, three heterogeneous LLM agents jointly assign scores under a coordinated multi-agent protocol. Third, a threshold-driven response filtering mechanism selects high-quality outputs. This work pioneers the “LLM-as-meta-evaluator” paradigm, advancing beyond single-model self-assessment limitations. Evaluated on JudgeBench, our framework achieves a 15.55% improvement over original human judgments and an 8.37% gain over single-agent baselines, demonstrating significantly enhanced inter-evaluator consistency and robustness against response variability and annotator subjectivity.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) are being widely applied across various fields, but as tasks become more complex, evaluating their responses is increasingly challenging. Compared to human evaluators, the use of LLMs to support performance evaluation offers a more efficient alternative. However, most studies focus mainly on aligning LLMs' judgments with human preferences, overlooking the existence of biases and mistakes in human judgment. Furthermore, how to select suitable LLM judgments given multiple potential LLM responses remains underexplored. To address these two aforementioned issues, we propose a three-stage meta-judge selection pipeline: 1) developing a comprehensive rubric with GPT-4 and human experts, 2) using three advanced LLM agents to score judgments, and 3) applying a threshold to filter out low-scoring judgments. Compared to methods using a single LLM as both judge and meta-judge, our pipeline introduces multi-agent collaboration and a more comprehensive rubric. Experimental results on the JudgeBench dataset show about 15.55% improvement compared to raw judgments and about 8.37% improvement over the single-agent baseline. Our work demonstrates the potential of LLMs as meta-judges and lays the foundation for future research on constructing preference datasets for LLM-as-a-judge reinforcement learning.
Problem

Research questions and friction points this paper is trying to address.

Evaluating complex LLM responses is challenging
Human judgment biases are often overlooked
Selecting suitable LLM judgments lacks exploration
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-agent LLM collaboration for judgment evaluation
GPT-4 and human expert rubric development
Threshold filtering for low-scoring judgments
🔎 Similar Papers
Y
Yuran Li
Intelligent Automation Lab, McGill University
J
Jama Hussein Mohamud
Mila, Quebec AI Institute
C
Chongren Sun
Intelligent Automation Lab, McGill University
D
Di Wu
Intelligent Automation Lab, McGill University
Benoit Boulet
Benoit Boulet
Professor of Electrical & Computer Engineering, McGill University
Robust ControlElectric VehiclesBiomedical SystemsRenewable Energy