🤖 AI Summary
Traditional embedding-based information retrieval methods are limited by shallow semantic matching and struggle to capture deep relevance between queries and documents. This work proposes LLM-RJS, a large language model–based relevance judgment system endowed with explicit reasoning capabilities to assess relevance through semantic inference. Evaluated on TREC-DL 2019 against neural embedding approaches, LLM-RJS reveals a “short-sighted bias” in existing relevance annotations: many of its false positives stem not from model errors but from deficiencies in the ground-truth labels, leading to an underestimation of its true performance. The study highlights critical limitations in current evaluation paradigms and calls for the development of assessment frameworks better aligned with the reasoning capacities of advanced language models.
📝 Abstract
With the emergence of Large Language Models (LLMs), new methods in Information Retrieval are available in which relevance is estimated directly through language understanding and reasoning, instead of embedding similarity. We argue that similarity is a short-sighted interpretation of relevance, and that LLM-Based Relevance Judgment Systems (LLM-RJS) (with reasoning) have potential to outperform Neural Embedding Retrieval Systems (NERS) by overcoming this limitation. Using the TREC-DL 2019 passage retrieval dataset, we compare various LLM-RJS with NERS, but observe no noticeable improvement. Subsequently, we analyze the impact of reasoning by comparing LLM-RJS with and without reasoning. We find that human annotations also suffer from short-sightedness, and that false-positives in the reasoning LLM-RJS are primarily mistakes in annotations due to short-sightedness. We conclude that LLM-RJS do have the ability to address the short-sightedness limitation in NERS, but that this cannot be evaluated with standard annotated relevance datasets.