Judging the Judges: A Collection of LLM-Generated Relevance Judgements

📅 2025-02-19

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

This study addresses three key challenges in information retrieval (IR) evaluation: the high cost of manual relevance annotation, difficulties in evaluating IR systems under low-resource conditions, and systematic biases and performance trade-offs inherent in using large language models (LLMs) as evaluators. To this end, we construct and evaluate 42 sets of LLM-generated relevance judgments covering the TREC 2023 Deep Learning track. We introduce LLMJudge—the first large-scale, publicly released LLM-based relevance annotation benchmark—featuring multi-model (open- and closed-weight), multi-prompt, and multi-team collaborative annotations from eight international research teams. Through end-to-end automated judgment generation and consistency analysis, we demonstrate that certain LLM evaluators achieve human-level performance, enabling bias diagnostics, model ensembling, and human–LLM evaluation comparison. The benchmark is fully reproducible and establishes a novel, empirically grounded paradigm for automated IR evaluation.

Technology Category

Application Category

📝 Abstract

Using Large Language Models (LLMs) for relevance assessments offers promising opportunities to improve Information Retrieval (IR), Natural Language Processing (NLP), and related fields. Indeed, LLMs hold the promise of allowing IR experimenters to build evaluation collections with a fraction of the manual human labor currently required. This could help with fresh topics on which there is still limited knowledge and could mitigate the challenges of evaluating ranking systems in low-resource scenarios, where it is challenging to find human annotators. Given the fast-paced recent developments in the domain, many questions concerning LLMs as assessors are yet to be answered. Among the aspects that require further investigation, we can list the impact of various components in a relevance judgment generation pipeline, such as the prompt used or the LLM chosen. This paper benchmarks and reports on the results of a large-scale automatic relevance judgment evaluation, the LLMJudge challenge at SIGIR 2024, where different relevance assessment approaches were proposed. In detail, we release and benchmark 42 LLM-generated labels of the TREC 2023 Deep Learning track relevance judgments produced by eight international teams who participated in the challenge. Given their diverse nature, these automatically generated relevance judgments can help the community not only investigate systematic biases caused by LLMs but also explore the effectiveness of ensemble models, analyze the trade-offs between different models and human assessors, and advance methodologies for improving automated evaluation techniques. The released resource is available at the following link: https://llm4eval.github.io/LLMJudge-benchmark/

Problem

Research questions and friction points this paper is trying to address.

Evaluate LLMs for relevance judgment generation.

Assess impact of prompts and LLMs in IR.

Benchmark automated relevance judgments in SIGIR 2024.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Utilizes LLM for relevance assessments

Benchmarks 42 LLM-generated relevance labels

Explores ensemble models and biases

🔎 Similar Papers

Leveraging Large Language Models for Relevance Judgments in Legal Case Retrieval

2024-03-27arXiv.orgCitations: 9

LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks

2024-06-26arXiv.orgCitations: 69

Ranking Generated Answers: On the Agreement of Retrieval Models with Humans on Consumer Health Questions

2024-08-19arXiv.orgCitations: 0