Rankers, Judges, and Assistants: Towards Understanding the Interplay of LLMs in Information Retrieval Evaluation

📅 2025-03-24

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

This work identifies systematic biases in using large language models (LLMs) for information retrieval (IR) evaluation: LLM-based evaluators exhibit pronounced “source preference” toward LLM-generated rankings, struggle to discern fine-grained performance differences, and are susceptible to artifacts from AI assistant outputs—yet show no inherent bias against AI-generated content. To rigorously characterize these biases, the study employs a multi-model collaborative experimental design, controlled prompt engineering, and human calibration—constituting the first empirical validation of such phenomena. Based on these findings, the authors propose an integrated LLM-IR ecosystem evaluation framework, accompanied by a reproducible bias diagnostic protocol and a structured research roadmap. This advances IR evaluation toward greater reliability, transparency, and methodological rigor in LLM-driven systems. (132 words)

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) are increasingly integral to information retrieval (IR), powering ranking, evaluation, and AI-assisted content creation. This widespread adoption necessitates a critical examination of potential biases arising from the interplay between these LLM-based components. This paper synthesizes existing research and presents novel experiment designs that explore how LLM-based rankers and assistants influence LLM-based judges. We provide the first empirical evidence of LLM judges exhibiting significant bias towards LLM-based rankers. Furthermore, we observe limitations in LLM judges' ability to discern subtle system performance differences. Contrary to some previous findings, our preliminary study does not find evidence of bias against AI-generated content. These results highlight the need for a more holistic view of the LLM-driven information ecosystem. To this end, we offer initial guidelines and a research agenda to ensure the reliable use of LLMs in IR evaluation.

Problem

Research questions and friction points this paper is trying to address.

Examines biases in LLM-based IR evaluation components

Assesses LLM judges' bias towards LLM rankers

Explores limitations in LLM judges' performance discernment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Explores bias in LLM-based rankers and judges

Tests LLM judges' performance discernment limitations

Proposes guidelines for reliable LLM IR evaluation

🔎 Similar Papers

Generating Diverse Criteria On-the-Fly to Improve Point-wise LLM Rankers