When LLM Judges Inflate Scores: Exploring Overrating in Relevance Assessment

📅 2026-02-19

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

This study addresses a critical limitation of large language models (LLMs) in information retrieval relevance assessment: their systematic tendency to overrate irrelevant content, thereby challenging the assumption that LLMs can directly substitute human evaluators. Through controlled experiments employing diverse LLM architectures, both pointwise and pairwise evaluation paradigms, and paragraph perturbation strategies, the work systematically diagnoses scoring biases in LLM-based relevance judgments. It reveals for the first time that LLMs frequently assign high-confidence, high scores to passages that fail to satisfy the underlying information need, with their assessments heavily influenced by superficial cues such as passage length and lexical surface features. These findings underscore significant risks in deploying LLMs for direct relevance evaluation and provide crucial empirical evidence and cautionary guidance for the design of more reliable automated assessment methodologies.

Technology Category

Application Category

📝 Abstract

Human relevance assessment is time-consuming and cognitively intensive, limiting the scalability of Information Retrieval evaluation. This has led to growing interest in using large language models (LLMs) as proxies for human judges. However, it remains an open question whether LLM-based relevance judgments are reliable, stable, and rigorous enough to match humans for relevance assessment. In this work, we conduct a systematic study of overrating behavior in LLM-based relevance judgments across model backbones, evaluation paradigms (pointwise and pairwise), and passage modification strategies. We show that models consistently assign inflated relevance scores -- often with high confidence -- to passages that do not genuinely satisfy the underlying information need, revealing a system-wide bias rather than random fluctuations in judgment. Furthermore, controlled experiments show that LLM-based relevance judgments can be highly sensitive to passage length and surface-level lexical cues. These results raise concerns about the usage of LLMs as drop-in replacements for human relevance assessors, and highlight the urgent need for careful diagnostic evaluation frameworks when applying LLMs for relevance assessments. Our code and results are publicly available.

Problem

Research questions and friction points this paper is trying to address.

LLM judges

relevance assessment

overrating

information retrieval

evaluation bias

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-based relevance judgment

overrating bias

information retrieval evaluation