🤖 AI Summary
This work exposes critical limitations of large language models (LLMs) as automatic relevance judges in offline information retrieval evaluation: LLM-generated judgments fail to fairly rank top-performing retrieval systems and inflate false positive rates in statistical significance tests (e.g., t-tests, ANOVA) to over 60%, diverging substantially from human annotations. Methodologically, the study systematically evaluates how LLM-based assessment—using state-of-the-art models including GPT-4 and Claude—affects relative ranking fidelity and pairwise significance preservation among leading retrieval systems, quantifying deviations from human ground truth via Kendall’s τ and hypothesis testing. The core contribution is the introduction of “statistical significance fidelity” as a novel evaluation dimension, providing crucial empirical evidence, methodological insights, and a cautionary benchmark for developing reliable automated IR evaluation paradigms.
📝 Abstract
Offline evaluation of search systems depends on test collections. These benchmarks provide the researchers with a corpus of documents, topics and relevance judgements indicating which documents are relevant for each topic. While test collections are an integral part of Information Retrieval (IR) research, their creation involves significant efforts in manual annotation. Large language models (LLMs) are gaining much attention as tools for automatic relevance assessment. Recent research has shown that LLM-based assessments yield high systems ranking correlation with human-made judgements. These correlations are helpful in large-scale experiments but less informative if we want to focus on top-performing systems. Moreover, these correlations ignore whether and how LLM-based judgements impact the statistically significant differences among systems with respect to human assessments. In this work, we look at how LLM-generated judgements preserve ranking differences among top-performing systems and also how they preserve pairwise significance evaluation as human judgements. Our results show that LLM-based judgements are unfair at ranking top-performing systems. Moreover, we observe an exceedingly high rate of false positives regarding statistical differences. Our work represents a step forward in the evaluation of the reliability of using LLMs-based judgements for IR evaluation. We hope this will serve as a basis for other researchers to develop more reliable models for automatic relevance assessment.