ImpliRet: Benchmarking the Implicit Fact Retrieval Challenge

📅 2025-06-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limited capability of retrieval systems in reasoning over implicit facts—such as temporal relations, numerical computations, and commonsense inferences—embedded within documents, moving beyond surface-level matching or query-side complexity. To this end, we introduce ImpliRet, the first benchmark explicitly designed for “concise queries + complex document-side reasoning,” shifting the core retrieval challenge from the query to the document side. ImpliRet establishes a fine-grained evaluation framework covering temporal, numerical, and world-knowledge implicitness. We evaluate diverse retrievers—including BM25, DPR, and GPT-4.1 with extended context—using human annotations and controlled document construction. Results reveal severe limitations: state-of-the-art sparse and dense retrievers achieve only 15.07% nDCG@10, while GPT-4.1 attains merely 35.06% accuracy when reasoning over 10 retrieved documents. These findings confirm that document-side implicit reasoning remains a critical bottleneck in modern retrieval systems.

Technology Category

Application Category

📝 Abstract
Retrieval systems are central to many NLP pipelines, but often rely on surface-level cues such as keyword overlap and lexical semantic similarity. To evaluate retrieval beyond these shallow signals, recent benchmarks introduce reasoning-heavy queries; however, they primarily shift the burden to query-side processing techniques -- like prompting or multi-hop retrieval -- that can help resolve complexity. In contrast, we present ImpliRet, a benchmark that shifts the reasoning challenge to document-side processing: The queries are simple, but relevance depends on facts stated implicitly in documents through temporal (e.g., resolving"two days ago"), arithmetic, and world knowledge relationships. We evaluate a range of sparse and dense retrievers, all of which struggle in this setting: the best nDCG@10 is only 15.07%. We also test whether long-context models can overcome this limitation. But even with a short context of only ten documents, including the positive document, GPT-4.1 scores only 35.06%, showing that document-side reasoning remains a challenge. Our codes are available at github.com/ZeinabTaghavi/IMPLIRET.Contribution.
Problem

Research questions and friction points this paper is trying to address.

Evaluates retrieval systems using implicit document facts
Tests sparse and dense retrievers on reasoning challenges
Assesses long-context models on document-side reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark focuses on implicit document-side facts
Evaluates sparse and dense retrieval methods
Tests long-context models like GPT-4.1
🔎 Similar Papers
No similar papers found.
Z
Z. Taghavi
Center for Information and Language Processing, LMU Munich; Munich Center for Machine Learning (MCML)
Ali Modarressi
Ali Modarressi
PhD student at LMU Munich
Natural Language ProcessingDeep LearningArtificial Intelligence
Yunpu Ma
Yunpu Ma
Ludwig Maximilian University of Munich
Foundation ModelsAgentic AITemporal Knowledge GraphQuantum AI
H
Hinrich Schutze
Center for Information and Language Processing, LMU Munich; Munich Center for Machine Learning (MCML)