π€ AI Summary
This work addresses the poor performance of retrieval modules in existing deep research agents when handling reasoning-intensive scientific queries. To this end, the authors introduce SAGE, a benchmark comprising 1,200 interdisciplinary queries and 200,000 scientific papers, and present the first systematic evaluation of large language model (LLM)-based retrievers in this setting. Their analysis reveals that LLM retrievers significantly underperform BM25βby approximately 30%βon keyword-style subqueries. To improve retrieval effectiveness, they propose a corpus-level test-time expansion framework that leverages LLMs to enrich documents with metadata and keywords. Experiments on SAGE demonstrate that this approach boosts retrieval performance by 8% for short-answer questions and by 2% for open-ended questions, confirming the efficacy of metadata augmentation in complex scientific retrieval tasks.
π Abstract
Deep research agents have emerged as powerful systems for addressing complex queries. Meanwhile, LLM-based retrievers have demonstrated strong capability in following instructions or reasoning. This raises a critical question: can LLM-based retrievers effectively contribute to deep research agent workflows? To investigate this, we introduce SAGE, a benchmark for scientific literature retrieval comprising 1,200 queries across four scientific domains, with a 200,000 paper retrieval corpus. We evaluate six deep research agents and find that all systems struggle with reasoning-intensive retrieval. Using DR Tulu as backbone, we further compare BM25 and LLM-based retrievers (i.e., ReasonIR and gte-Qwen2-7B-instruct) as alternative search tools. Surprisingly, BM25 significantly outperforms LLM-based retrievers by approximately 30%, as existing agents generate keyword-oriented sub-queries. To improve performance, we propose a corpus-level test-time scaling framework that uses LLMs to augment documents with metadata and keywords, making retrieval easier for off-the-shelf retrievers. This yields 8% and 2% gains on short-form and open-ended questions, respectively.