SAGE: Benchmarking and Improving Retrieval for Deep Research Agents

📅 2026-02-05

📈 Citations: 0

✨ Influential: 0

career value

228K/year

🤖 AI Summary

This work addresses the poor performance of retrieval modules in existing deep research agents when handling reasoning-intensive scientific queries. To this end, the authors introduce SAGE, a benchmark comprising 1,200 interdisciplinary queries and 200,000 scientific papers, and present the first systematic evaluation of large language model (LLM)-based retrievers in this setting. Their analysis reveals that LLM retrievers significantly underperform BM25—by approximately 30%—on keyword-style subqueries. To improve retrieval effectiveness, they propose a corpus-level test-time expansion framework that leverages LLMs to enrich documents with metadata and keywords. Experiments on SAGE demonstrate that this approach boosts retrieval performance by 8% for short-answer questions and by 2% for open-ended questions, confirming the efficacy of metadata augmentation in complex scientific retrieval tasks.

Technology Category

Application Category

📝 Abstract

Deep research agents have emerged as powerful systems for addressing complex queries. Meanwhile, LLM-based retrievers have demonstrated strong capability in following instructions or reasoning. This raises a critical question: can LLM-based retrievers effectively contribute to deep research agent workflows? To investigate this, we introduce SAGE, a benchmark for scientific literature retrieval comprising 1,200 queries across four scientific domains, with a 200,000 paper retrieval corpus. We evaluate six deep research agents and find that all systems struggle with reasoning-intensive retrieval. Using DR Tulu as backbone, we further compare BM25 and LLM-based retrievers (i.e., ReasonIR and gte-Qwen2-7B-instruct) as alternative search tools. Surprisingly, BM25 significantly outperforms LLM-based retrievers by approximately 30%, as existing agents generate keyword-oriented sub-queries. To improve performance, we propose a corpus-level test-time scaling framework that uses LLMs to augment documents with metadata and keywords, making retrieval easier for off-the-shelf retrievers. This yields 8% and 2% gains on short-form and open-ended questions, respectively.

Problem

Research questions and friction points this paper is trying to address.

deep research agents

LLM-based retrievers

scientific literature retrieval

reasoning-intensive retrieval

retrieval benchmark

Innovation

Methods, ideas, or system contributions that make the work stand out.

SAGE benchmark

LLM-based retriever

corpus-level test-time scaling