ReasonIR: Training Retrievers for Reasoning Tasks

📅 2025-04-29

📈 Citations: 0

✨ Influential: 0

career value

152K/year

🤖 AI Summary

Existing retrievers exhibit limited performance on complex reasoning tasks, primarily due to training data dominated by simple factual queries and a lack of high-quality supervision signals tailored for deep reasoning. To address this, we propose ReasonIR-8B—the first general-purpose retriever explicitly optimized for reasoning. Our method introduces a novel reasoning-oriented synthetic data generation paradigm that produces high-difficulty queries and semantically ambiguous hard negatives. We further design a hybrid training strategy that jointly optimizes the retriever and re-ranker, while integrating query rewriting and RAG-based enhancement. On the BRIGHT benchmark, ReasonIR-8B achieves nDCG@10 of 29.9 (retrieval-only) and 36.9 (with re-ranking). It also improves accuracy on MMLU and GPQA by 6.4% and 22.6%, respectively. All components—including the model, training code, and synthetic datasets—are fully open-sourced.

Technology Category

Application Category

📝 Abstract

We present ReasonIR-8B, the first retriever specifically trained for general reasoning tasks. Existing retrievers have shown limited gains on reasoning tasks, in part because existing training datasets focus on short factual queries tied to documents that straightforwardly answer them. We develop a synthetic data generation pipeline that, for each document, our pipeline creates a challenging and relevant query, along with a plausibly related but ultimately unhelpful hard negative. By training on a mixture of our synthetic data and existing public data, ReasonIR-8B achieves a new state-of-the-art of 29.9 nDCG@10 without reranker and 36.9 nDCG@10 with reranker on BRIGHT, a widely-used reasoning-intensive information retrieval (IR) benchmark. When applied to RAG tasks, ReasonIR-8B improves MMLU and GPQA performance by 6.4% and 22.6% respectively, relative to the closed-book baseline, outperforming other retrievers and search engines. In addition, ReasonIR-8B uses test-time compute more effectively: on BRIGHT, its performance consistently increases with longer and more information-rich rewritten queries; it continues to outperform other retrievers when combined with an LLM reranker. Our training recipe is general and can be easily extended to future LLMs; to this end, we open-source our code, data, and model.

Problem

Research questions and friction points this paper is trying to address.

Develops retriever for general reasoning tasks

Improves performance on reasoning-intensive IR benchmarks

Enhances RAG task accuracy over existing retrievers

Innovation

Methods, ideas, or system contributions that make the work stand out.

Synthetic data generation for challenging queries

Training mix of synthetic and public data

Effective test-time compute with longer queries

🔎 Similar Papers

Semantic Self-Consistency: Enhancing Language Model Reasoning via Semantic Weighting