Support Evaluation for the TREC 2024 RAG Track: Comparing Human versus LLM Judges

📅 2025-04-21

📈 Citations: 0

✨ Influential: 0

career value

161K/year

🤖 AI Summary

This work addresses the problem of assessing the supporting relevance of retrieved documents to generated answers in Retrieval-Augmented Generation (RAG) systems. We systematically compare the consistency and reliability of Large Language Models (LLMs)—specifically GPT-4o—against human annotators in this evaluation task. Our methodology comprises three-tier manual annotation, bias-mitigated triple independent validation, GPT-4o–based automated judgment, and qualitative error attribution analysis. For the first time, large-scale empirical results show that GPT-4o achieves 56% zero-shot agreement with human judgments; introducing a novel “human post-editing” paradigm improves agreement to 72%. Moreover, GPT-4o’s correlation with independent human raters significantly exceeds inter-human correlation. These findings demonstrate that GPT-4o not only matches but—in certain aspects surpasses—human evaluators, while the human post-editing paradigm jointly optimizes efficiency and fidelity. This establishes a scalable, high-fidelity methodological framework for RAG evaluation.

Technology Category

Application Category

📝 Abstract

Retrieval-augmented generation (RAG) enables large language models (LLMs) to generate answers with citations from source documents containing"ground truth", thereby reducing system hallucinations. A crucial factor in RAG evaluation is"support", whether the information in the cited documents supports the answer. To this end, we conducted a large-scale comparative study of 45 participant submissions on 36 topics to the TREC 2024 RAG Track, comparing an automatic LLM judge (GPT-4o) against human judges for support assessment. We considered two conditions: (1) fully manual assessments from scratch and (2) manual assessments with post-editing of LLM predictions. Our results indicate that for 56% of the manual from-scratch assessments, human and GPT-4o predictions match perfectly (on a three-level scale), increasing to 72% in the manual with post-editing condition. Furthermore, by carefully analyzing the disagreements in an unbiased study, we found that an independent human judge correlates better with GPT-4o than a human judge, suggesting that LLM judges can be a reliable alternative for support assessment. To conclude, we provide a qualitative analysis of human and GPT-4o errors to help guide future iterations of support assessment.

Problem

Research questions and friction points this paper is trying to address.

Evaluating support in RAG systems using human vs LLM judges

Comparing accuracy of GPT-4o and human support assessments

Analyzing disagreements to improve future support evaluation methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

Used GPT-4o for automatic support assessment

Compared human and LLM judges on RAG

Post-editing improved human-LLM agreement to 72%

🔎 Similar Papers

Leveraging Large Language Models for Relevance Judgments in Legal Case Retrieval