Benchmarking LLM-based Relevance Judgment Methods

πŸ“… 2025-04-17
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work systematically compares five LLM-based relevance judgment paradigms: binary classification, graded assessment, pairwise preference, nugget-agnostic evaluation, and nugget-aware evaluation. We evaluate Llama-3.2B and GPT-4o under a unified framework on TREC DL and ANTIQUE, employing diverse prompting strategies, pairwise ranking modeling, and nugget-level fine-grained annotations. Methodologically, we introduce human preference alignment analysisβ€”a novel metric extending beyond traditional Kendall Ο„β€”to quantitatively assess agreement with human judgments. Results demonstrate that pairwise preference and nugget-aware approaches achieve significantly higher human consistency than graded assessment. All code, datasets, and annotations are publicly released, establishing the first horizontally comparable benchmark and empirically grounded guidance for LLM-based relevance evaluation. (136 words)

Technology Category

Application Category

πŸ“ Abstract
Large Language Models (LLMs) are increasingly deployed in both academic and industry settings to automate the evaluation of information seeking systems, particularly by generating graded relevance judgments. Previous work on LLM-based relevance assessment has primarily focused on replicating graded human relevance judgments through various prompting strategies. However, there has been limited exploration of alternative assessment methods or comprehensive comparative studies. In this paper, we systematically compare multiple LLM-based relevance assessment methods, including binary relevance judgments, graded relevance assessments, pairwise preference-based methods, and two nugget-based evaluation methods~--~document-agnostic and document-dependent. In addition to a traditional comparison based on system rankings using Kendall correlations, we also examine how well LLM judgments align with human preferences, as inferred from relevance grades. We conduct extensive experiments on datasets from three TREC Deep Learning tracks 2019, 2020 and 2021 as well as the ANTIQUE dataset, which focuses on non-factoid open-domain question answering. As part of our data release, we include relevance judgments generated by both an open-source (Llama3.2b) and a commercial (gpt-4o) model. Our goal is to extit{reproduce} various LLM-based relevance judgment methods to provide a comprehensive comparison. All code, data, and resources are publicly available in our GitHub Repository at https://github.com/Narabzad/llm-relevance-judgement-comparison.
Problem

Research questions and friction points this paper is trying to address.

Compare LLM-based relevance judgment methods comprehensively
Evaluate alignment of LLM judgments with human preferences
Reproduce various LLM-based relevance assessment techniques
Innovation

Methods, ideas, or system contributions that make the work stand out.

Compare multiple LLM-based relevance assessment methods
Include binary and graded relevance judgments
Evaluate alignment with human preferences
πŸ”Ž Similar Papers
No similar papers found.
Negar Arabzadeh
Negar Arabzadeh
UC Berkeley
Information retrievalNatural Language ProcessingEvaluation
C
Charles L.A. Clarke
University of Waterloo, Waterloo, Ontario, Canada