Judge as A Judge: Improving the Evaluation of Retrieval-Augmented Generation through the Judge-Consistency of Large Language Models

📅 2025-02-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Low inter-annotator consistency and poor reliability of LLM-based judges hinder automatic evaluation of RAG models. To address this, we propose Judge-Consistency (ConsJudge), a novel method that elicits fine-grained judgments from LLMs via multi-perspective prompting and leverages cross-prompt judgment consistency as the primary supervisory signal to select high-quality samples for DPO training. ConsJudge is the first approach to explicitly model judgment consistency as the core optimization objective—moving beyond the conventional “LLM-as-a-judge” paradigm that relies solely on single-response evaluation. Extensive experiments across multiple RAG models and benchmark datasets demonstrate that ConsJudge significantly improves evaluation accuracy and robustness, achieving strong inter-judge agreement (Cohen’s κ > 0.85) with state-of-the-art LLM judges. The implementation is publicly available.

Technology Category

Application Category

📝 Abstract
Retrieval-Augmented Generation (RAG) has proven its effectiveness in alleviating hallucinations for Large Language Models (LLMs). However, existing automated evaluation metrics cannot fairly evaluate the outputs generated by RAG models during training and evaluation. LLM-based judgment models provide the potential to produce high-quality judgments, but they are highly sensitive to evaluation prompts, leading to inconsistencies when judging the output of RAG models. This paper introduces the Judge-Consistency (ConsJudge) method, which aims to enhance LLMs to generate more accurate evaluations for RAG models. Specifically, ConsJudge prompts LLMs to generate different judgments based on various combinations of judgment dimensions, utilize the judge-consistency to evaluate these judgments and select the accepted and rejected judgments for DPO training. Our experiments show that ConsJudge can effectively provide more accurate judgments for optimizing RAG models across various RAG models and datasets. Further analysis reveals that judgments generated by ConsJudge have a high agreement with the superior LLM. All codes are available at https://github.com/OpenBMB/ConsJudge.
Problem

Research questions and friction points this paper is trying to address.

Improving RAG model evaluation accuracy
Reducing LLM judgment inconsistencies
Enhancing LLM-based evaluation prompts
Innovation

Methods, ideas, or system contributions that make the work stand out.

ConsJudge enhances LLM evaluations
Utilizes judge-consistency for DPO training
Improves RAG model accuracy
🔎 Similar Papers
No similar papers found.
Shuliang Liu
Shuliang Liu
PhD, HKUST(GZ)
Trustworthy LLMVLMRecommendation System
X
Xinze Li
Department of Computer Science and Technology, Northeastern University, China
Zhenghao Liu
Zhenghao Liu
Northeastern University
NLPInformation Retrieval
Yukun Yan
Yukun Yan
Tsinghua University
Large Language Model
C
Cheng Yang
School of Computer Science, Beijing University of Posts and Telecommunications
Zheni Zeng
Zheni Zeng
Nanjing University, Tsinghua University
AI for scienceLarge Language Model
Z
Zhiyuan Liu
Department of Computer Science and Technology, Institute for AI, Tsinghua University, China, Beijing National Research Center for Information Science and Technology, China
Maosong Sun
Maosong Sun
Professor of Computer Science and Technology, Tsinghua University
Natural Language ProcessingArtificial IntelligenceSocial Computing
G
Ge Yu
Department of Computer Science and Technology, Northeastern University, China