The LLM Effect on IR Benchmarks: A Meta-Analysis of Effectiveness, Baselines, and Contamination

📅 2026-04-07

📈 Citations: 0

✨ Influential: 0

career value

165K/year

🤖 AI Summary

This study addresses the growing concern that the widespread adoption of large language models (LLMs) may compromise the validity of traditional information retrieval (IR) benchmarks due to potential data contamination. Through a meta-analysis of 143 studies on the TREC Robust04 and Deep Learning 2020 (DL20) benchmarks, the authors systematically identify and formally define the “LLM effect,” propose a method for detecting data contamination in re-ranking tasks, and quantify its impact on evaluation outcomes. Results show that LLM-based systems achieve an 8.8% gain in nDCG@10 on DL20 and approximately 20% improvement on Robust04 since 2023; however, performance declines when contaminated topics are excluded. Due to wide confidence intervals, it remains uncertain whether observed gains reflect genuine progress or artifacts of data leakage.

Technology Category

Application Category

📝 Abstract

Benchmark collections have long enabled controlled comparison and cumulative progress in Information Retrieval (IR). However, prior meta-analyses have shown that reported effectiveness gains often fail to accumulate, in part due to the use of weak or outdated baselines. While large language models are increasingly used in retrieval pipelines, their impact on established IR benchmarks has not been systematically analyzed. In this study, we analyze 143 publications reporting results on the TREC Robust04 collection and the TREC Deep Learning 2020 (DL20) passage retrieval benchmark to examine longitudinal trends in retrieval effectiveness and baseline strength. We observe what we term an \emph{LLM effect}: recent systems incorporating LLM components achieve 8.8\% higher nDCG@10 on DL20 compared to the best result from TREC 2020 and approximately 20\% higher on Robust04 since 2023. However, adapting a data contamination detection approach to reranking reveals measurable contamination in both benchmarks. While excluding contaminated topics reduces effectiveness, confidence intervals remain wide, making it difficult to determine whether the LLM effect reflects genuine methodological advances or memorization from pretraining data.

Problem

Research questions and friction points this paper is trying to address.

large language models

information retrieval

benchmark contamination

effectiveness evaluation

baseline strength

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM effect

data contamination

information retrieval benchmarks