🤖 AI Summary
This work addresses the limited capability of retrieval systems in reasoning over implicit facts—such as temporal relations, numerical computations, and commonsense inferences—embedded within documents, moving beyond surface-level matching or query-side complexity. To this end, we introduce ImpliRet, the first benchmark explicitly designed for “concise queries + complex document-side reasoning,” shifting the core retrieval challenge from the query to the document side. ImpliRet establishes a fine-grained evaluation framework covering temporal, numerical, and world-knowledge implicitness. We evaluate diverse retrievers—including BM25, DPR, and GPT-4.1 with extended context—using human annotations and controlled document construction. Results reveal severe limitations: state-of-the-art sparse and dense retrievers achieve only 15.07% nDCG@10, while GPT-4.1 attains merely 35.06% accuracy when reasoning over 10 retrieved documents. These findings confirm that document-side implicit reasoning remains a critical bottleneck in modern retrieval systems.
📝 Abstract
Retrieval systems are central to many NLP pipelines, but often rely on surface-level cues such as keyword overlap and lexical semantic similarity. To evaluate retrieval beyond these shallow signals, recent benchmarks introduce reasoning-heavy queries; however, they primarily shift the burden to query-side processing techniques -- like prompting or multi-hop retrieval -- that can help resolve complexity. In contrast, we present ImpliRet, a benchmark that shifts the reasoning challenge to document-side processing: The queries are simple, but relevance depends on facts stated implicitly in documents through temporal (e.g., resolving"two days ago"), arithmetic, and world knowledge relationships. We evaluate a range of sparse and dense retrievers, all of which struggle in this setting: the best nDCG@10 is only 15.07%. We also test whether long-context models can overcome this limitation. But even with a short context of only ten documents, including the positive document, GPT-4.1 scores only 35.06%, showing that document-side reasoning remains a challenge. Our codes are available at github.com/ZeinabTaghavi/IMPLIRET.Contribution.