🤖 AI Summary
Existing retrieval-augmented generation methods perform well on knowledge-intensive tasks but struggle with complex queries requiring abstract reasoning, analogy, or long-range logical inference. This paper introduces a multi-stage framework tailored for reasoning-intensive information retrieval. Our approach comprises three core components: (1) a large language model–driven iterative query expansion that explicitly models reasoning intent; (2) a reasoning-enhanced retriever fine-tuned on synthetically generated multi-domain data augmented with hard negative samples; and (3) a pointwise re-ranking mechanism integrating LLM-based utility scoring. This end-to-end, reasoning-aware pipeline significantly improves retrieval relevance. On the BRIGHT benchmark, it achieves nDCG@10 scores of 41.6 and 28.9—setting new state-of-the-art results—and demonstrates robust effectiveness in realistic, complex querying scenarios.
📝 Abstract
Retrieval-augmented generation has achieved strong performance on knowledge-intensive tasks where query-document relevance can be identified through direct lexical or semantic matches. However, many real-world queries involve abstract reasoning, analogical thinking, or multi-step inference, which existing retrievers often struggle to capture. To address this challenge, we present extbf{DIVER}, a retrieval pipeline tailored for reasoning-intensive information retrieval. DIVER consists of four components: document processing to improve input quality, LLM-driven query expansion via iterative document interaction, a reasoning-enhanced retriever fine-tuned on synthetic multi-domain data with hard negatives, and a pointwise reranker that combines LLM-assigned helpfulness scores with retrieval scores. On the BRIGHT benchmark, DIVER achieves state-of-the-art nDCG@10 scores of 41.6 and 28.9 on original queries, consistently outperforming competitive reasoning-aware models. These results demonstrate the effectiveness of reasoning-aware retrieval strategies in complex real-world tasks. Our code and retrieval model will be released soon.