Can Deep Research Agents Retrieve and Organize? Evaluating the Synthesis Gap with Expert Taxonomies

📅 2026-01-18

📈 Citations: 0

✨ Influential: 0

career value

237K/year

🤖 AI Summary

This work addresses the lack of effective evaluation for deep research agents’ ability to retrieve key literature and construct expert-level knowledge structures in automated survey generation. To this end, we introduce TaxoBench, a benchmark built upon 3,815 expert-annotated core references and their corresponding taxonomy trees extracted from 72 highly cited computer science surveys. For the first time, expert-curated taxonomies serve as the gold standard to systematically evaluate the end-to-end retrieval and structuring performance of seven deep research agents and twelve state-of-the-art large language models under both Deep Research and Bottom-Up paradigms. Experimental results reveal that even the best-performing system recalls only 20.9% of expert-selected papers, with a maximum Adjusted Rand Index (ARI) of 0.31 in structural organization—far below expert quality—highlighting a significant gap in both evaluation frameworks and current system capabilities.

Technology Category

Application Category

📝 Abstract

Deep Research Agents increasingly automate survey generation, yet whether they match human experts in two core abilities remains unclear: retrieving essential papers and organizing them into expert-like taxonomies. Existing benchmarks emphasize writing quality or citation correctness, while standard clustering metrics fail to capture hierarchical taxonomy structure. We introduce TaxoBench, a benchmark built from 72 highly-cited LLM surveys containing expert-authored taxonomy trees with 3,815 papers mapped to paper categories as ground truth. TaxoBench evaluates both abilities: (1) retrieval, measuring whether agents retrieve expert-cited papers; and (2) organization, assessed at two levels: the leaf-level measures paper-to-category assignment, while the hierarchy-level measures taxonomy structure via novel metrics -- Unordered Semantic Tree Edit Distance (US-TED/US-NTED) and Semantic Path Similarity (Sem-Path). TaxoBench supports two evaluation modes: Deep Research tests end-to-end capability given only a topic, while Bottom-Up provides the expert paper set to isolate organization ability. Evaluating 7 Deep Research Agents and 12 frontier LLMs reveals a dual bottleneck: the best agent retrieves only 20.92% of expert-cited papers, and even with perfect input, the best model achieves only 31.24% ARI with substantial structural gaps. Our benchmark is publicly available at https://github.com/KongLongGeFDU/TaxoBench

Problem

Research questions and friction points this paper is trying to address.

Deep Research Agents

Survey Generation

Taxonomy Construction

Information Synthesis

Knowledge Organization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Deep Research Agents

Taxonomy-based Evaluation

Survey Generation