Can Deep Research Agents Retrieve and Organize? Evaluating the Synthesis Gap with Expert Taxonomies

📅 2026-01-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the lack of effective evaluation for deep research agents’ ability to retrieve key literature and construct expert-level knowledge structures in automated survey generation. To this end, we introduce TaxoBench, a benchmark built upon 3,815 expert-annotated core references and their corresponding taxonomy trees extracted from 72 highly cited computer science surveys. For the first time, expert-curated taxonomies serve as the gold standard to systematically evaluate the end-to-end retrieval and structuring performance of seven deep research agents and twelve state-of-the-art large language models under both Deep Research and Bottom-Up paradigms. Experimental results reveal that even the best-performing system recalls only 20.9% of expert-selected papers, with a maximum Adjusted Rand Index (ARI) of 0.31 in structural organization—far below expert quality—highlighting a significant gap in both evaluation frameworks and current system capabilities.

Technology Category

Application Category

📝 Abstract
Deep Research Agents increasingly automate survey generation, yet whether they match human experts in two core abilities remains unclear: retrieving essential papers and organizing them into expert-like taxonomies. Existing benchmarks emphasize writing quality or citation correctness, while standard clustering metrics fail to capture hierarchical taxonomy structure. We introduce TaxoBench, a benchmark built from 72 highly-cited LLM surveys containing expert-authored taxonomy trees with 3,815 papers mapped to paper categories as ground truth. TaxoBench evaluates both abilities: (1) retrieval, measuring whether agents retrieve expert-cited papers; and (2) organization, assessed at two levels: the leaf-level measures paper-to-category assignment, while the hierarchy-level measures taxonomy structure via novel metrics -- Unordered Semantic Tree Edit Distance (US-TED/US-NTED) and Semantic Path Similarity (Sem-Path). TaxoBench supports two evaluation modes: Deep Research tests end-to-end capability given only a topic, while Bottom-Up provides the expert paper set to isolate organization ability. Evaluating 7 Deep Research Agents and 12 frontier LLMs reveals a dual bottleneck: the best agent retrieves only 20.92% of expert-cited papers, and even with perfect input, the best model achieves only 31.24% ARI with substantial structural gaps. Our benchmark is publicly available at https://github.com/KongLongGeFDU/TaxoBench
Problem

Research questions and friction points this paper is trying to address.

Deep Research Agents
Survey Generation
Taxonomy Construction
Information Synthesis
Knowledge Organization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Deep Research Agents
Taxonomy-based Evaluation
Survey Generation
Knowledge Organization
Retrieval and Structuring
Ming Zhang
Ming Zhang
复旦大学计算机科学技术学院
LLM
J
Jiabao Zhuang
Fudan NLP Group
W
Wenqing Jing
Fudan NLP Group
Z
Ziyu Kong
Fudan NLP Group
J
Jingyi Deng
Fudan NLP Group
Y
Yujiong Shen
Fudan NLP Group
K
Kexin Tan
Fudan NLP Group
Y
Yuhang Zhao
Fudan NLP Group
N
Ning Luo
Fudan NLP Group
R
Renzhe Zheng
Fudan NLP Group
J
Jiahui Lin
Fudan NLP Group
Mingqi Wu
Mingqi Wu
Director of Data Science, Microsoft
AIMachine LearningStatisticsdata science
Long Ma
Long Ma
Dalian University of Technology
Computer VisionImage Processing
Yi Zou
Yi Zou
Intel Labs
Near-data and in-memory computingComputer Architecture and Computer SystemsNon-volatile storagedistributed storagebig da
Shihan Dou
Shihan Dou
Fudan University
LLMsCode LMsRLAlignment
T
Tao Gui
Fudan NLP Group
Qi Zhang
Qi Zhang
Fudan University
SAGINsatellite routing
X
Xuanjing Huang
Fudan NLP Group