🤖 AI Summary
Traditional information retrieval primarily emphasizes surface-level similarity between documents and queries, often overlooking their actual utility in supporting decision-making. This work proposes a novel retrieval paradigm centered on decision usefulness and introduces UsefulBench, the first benchmark dataset annotated by domain experts for both relevance and usefulness. Through systematic comparisons among classical retrieval models, large language models (LLMs), and human expert judgments, the study reveals that conventional methods strongly favor relevance, while LLMs, despite modest gains in usefulness, still fall short of replicating expert-level assessments. By establishing a new evaluation framework grounded in real-world utility, this research advances the foundation for usefulness-oriented information retrieval and provides a valuable resource for future development and assessment of retrieval systems.
📝 Abstract
Conventional information retrieval is concerned with identifying the relevance of texts for a given query. Yet, the conventional definition of relevance is dominated by aspects of similarity in texts, leaving unobserved whether the text is truly useful for addressing the query. For instance, when answering whether Paris is larger than Berlin, texts about Paris being in France are relevant (lexical/semantic similarity), but not useful. In this paper, we introduce UsefulBench, a domain-specific dataset curated by three professional analysts labeling whether a text is connected to a query (relevance) or holds practical value in responding to it (usefulness). We show that classic similarity-based information retrieval aligns more strongly with relevance. While LLM-based systems can counteract this bias, we find that domain-specific problems require a high degree of expertise, which current LLMs do not fully incorporate. We explore approaches to (partially) overcome this challenge. However, UsefulBench presents a dataset challenge for targeted information retrieval systems.