Query-Document Dense Vectors for LLM Relevance Judgment Bias Analysis

📅 2026-01-05

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

This study investigates whether large language models (LLMs) exhibit systematic biases in relevance judgments for information retrieval, moving beyond aggregate performance metrics. To this end, the authors propose representing relevance as a dense vector capturing the query-document relational property, thereby constructing a joint semantic embedding space. By integrating clustering analysis within this space, they identify semantic “hotspots” where LLM judgments diverge consistently from human assessors. Experiments on the TREC Deep Learning 2019/2020 datasets reveal that such biases are concentrated in specific semantic clusters—particularly those involving definitional, policy-related, and ambiguous queries—highlighting systematic limitations in both recall and precision. This approach establishes a novel paradigm for fine-grained diagnosis of relevance label distributions.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have been used as relevance assessors for Information Retrieval (IR) evaluation collection creation due to reduced cost and increased scalability as compared to human assessors. While previous research has looked at the reliability of LLMs as compared to human assessors, in this work, we aim to understand if LLMs make systematic mistakes when judging relevance, rather than just understanding how good they are on average. To this aim, we propose a novel representational method for queries and documents that allows us to analyze relevance label distributions and compare LLM and human labels to identify patterns of disagreement and localize systematic areas of disagreement. We introduce a clustering-based framework that embeds query-document (Q-D) pairs into a joint semantic space, treating relevance as a relational property. Experiments on TREC Deep Learning 2019 and 2020 show that systematic disagreement between humans and LLMs is concentrated in specific semantic clusters rather than distributed randomly. Query-level analyses reveal recurring failures, most often in definition-seeking, policy-related, or ambiguous contexts. Queries with large variation in agreement across their clusters emerge as disagreement hotspots, where LLMs tend to under-recall relevant content or over-include irrelevant material. This framework links global diagnostics with localized clustering to uncover hidden weaknesses in LLM judgments, enabling bias-aware and more reliable IR evaluation.

Problem

Research questions and friction points this paper is trying to address.

LLM relevance judgment

systematic bias

query-document relevance

evaluation reliability

human-LLM disagreement

Innovation

Methods, ideas, or system contributions that make the work stand out.

query-document dense vectors

relevance judgment bias

LLM systematic errors