When LLM Judges Inflate Scores: Exploring Overrating in Relevance Assessment

📅 2026-02-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses a critical limitation of large language models (LLMs) in information retrieval relevance assessment: their systematic tendency to overrate irrelevant content, thereby challenging the assumption that LLMs can directly substitute human evaluators. Through controlled experiments employing diverse LLM architectures, both pointwise and pairwise evaluation paradigms, and paragraph perturbation strategies, the work systematically diagnoses scoring biases in LLM-based relevance judgments. It reveals for the first time that LLMs frequently assign high-confidence, high scores to passages that fail to satisfy the underlying information need, with their assessments heavily influenced by superficial cues such as passage length and lexical surface features. These findings underscore significant risks in deploying LLMs for direct relevance evaluation and provide crucial empirical evidence and cautionary guidance for the design of more reliable automated assessment methodologies.

Technology Category

Application Category

📝 Abstract
Human relevance assessment is time-consuming and cognitively intensive, limiting the scalability of Information Retrieval evaluation. This has led to growing interest in using large language models (LLMs) as proxies for human judges. However, it remains an open question whether LLM-based relevance judgments are reliable, stable, and rigorous enough to match humans for relevance assessment. In this work, we conduct a systematic study of overrating behavior in LLM-based relevance judgments across model backbones, evaluation paradigms (pointwise and pairwise), and passage modification strategies. We show that models consistently assign inflated relevance scores -- often with high confidence -- to passages that do not genuinely satisfy the underlying information need, revealing a system-wide bias rather than random fluctuations in judgment. Furthermore, controlled experiments show that LLM-based relevance judgments can be highly sensitive to passage length and surface-level lexical cues. These results raise concerns about the usage of LLMs as drop-in replacements for human relevance assessors, and highlight the urgent need for careful diagnostic evaluation frameworks when applying LLMs for relevance assessments. Our code and results are publicly available.
Problem

Research questions and friction points this paper is trying to address.

LLM judges
relevance assessment
overrating
information retrieval
evaluation bias
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-based relevance judgment
overrating bias
information retrieval evaluation
systematic bias
diagnostic evaluation
🔎 Similar Papers
No similar papers found.