Leveraging LLMs as Meta-Judges: A Multi-Agent Framework for Evaluating LLM Judgments

📅 2025-04-23

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

To address two key challenges in LLM response evaluation—human judgment bias and difficulty in selecting optimal responses from multiple candidates—this paper proposes a three-stage meta-evaluation framework. First, human-AI collaboration constructs a fine-grained, interpretable scoring rubric. Second, three heterogeneous LLM agents jointly assign scores under a coordinated multi-agent protocol. Third, a threshold-driven response filtering mechanism selects high-quality outputs. This work pioneers the “LLM-as-meta-evaluator” paradigm, advancing beyond single-model self-assessment limitations. Evaluated on JudgeBench, our framework achieves a 15.55% improvement over original human judgments and an 8.37% gain over single-agent baselines, demonstrating significantly enhanced inter-evaluator consistency and robustness against response variability and annotator subjectivity.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) are being widely applied across various fields, but as tasks become more complex, evaluating their responses is increasingly challenging. Compared to human evaluators, the use of LLMs to support performance evaluation offers a more efficient alternative. However, most studies focus mainly on aligning LLMs' judgments with human preferences, overlooking the existence of biases and mistakes in human judgment. Furthermore, how to select suitable LLM judgments given multiple potential LLM responses remains underexplored. To address these two aforementioned issues, we propose a three-stage meta-judge selection pipeline: 1) developing a comprehensive rubric with GPT-4 and human experts, 2) using three advanced LLM agents to score judgments, and 3) applying a threshold to filter out low-scoring judgments. Compared to methods using a single LLM as both judge and meta-judge, our pipeline introduces multi-agent collaboration and a more comprehensive rubric. Experimental results on the JudgeBench dataset show about 15.55% improvement compared to raw judgments and about 8.37% improvement over the single-agent baseline. Our work demonstrates the potential of LLMs as meta-judges and lays the foundation for future research on constructing preference datasets for LLM-as-a-judge reinforcement learning.

Problem

Research questions and friction points this paper is trying to address.

Evaluating complex LLM responses is challenging

Human judgment biases are often overlooked

Selecting suitable LLM judgments lacks exploration

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-agent LLM collaboration for judgment evaluation

GPT-4 and human expert rubric development

Threshold filtering for low-scoring judgments

🔎 Similar Papers

Adversarial Multi-Agent Evaluation of Large Language Models through Iterative Debates