🤖 AI Summary
This work addresses the lack of effective evaluation for deep research agents’ ability to retrieve key literature and construct expert-level knowledge structures in automated survey generation. To this end, we introduce TaxoBench, a benchmark built upon 3,815 expert-annotated core references and their corresponding taxonomy trees extracted from 72 highly cited computer science surveys. For the first time, expert-curated taxonomies serve as the gold standard to systematically evaluate the end-to-end retrieval and structuring performance of seven deep research agents and twelve state-of-the-art large language models under both Deep Research and Bottom-Up paradigms. Experimental results reveal that even the best-performing system recalls only 20.9% of expert-selected papers, with a maximum Adjusted Rand Index (ARI) of 0.31 in structural organization—far below expert quality—highlighting a significant gap in both evaluation frameworks and current system capabilities.
📝 Abstract
Deep Research Agents increasingly automate survey generation, yet whether they match human experts in two core abilities remains unclear: retrieving essential papers and organizing them into expert-like taxonomies. Existing benchmarks emphasize writing quality or citation correctness, while standard clustering metrics fail to capture hierarchical taxonomy structure. We introduce TaxoBench, a benchmark built from 72 highly-cited LLM surveys containing expert-authored taxonomy trees with 3,815 papers mapped to paper categories as ground truth. TaxoBench evaluates both abilities: (1) retrieval, measuring whether agents retrieve expert-cited papers; and (2) organization, assessed at two levels: the leaf-level measures paper-to-category assignment, while the hierarchy-level measures taxonomy structure via novel metrics -- Unordered Semantic Tree Edit Distance (US-TED/US-NTED) and Semantic Path Similarity (Sem-Path). TaxoBench supports two evaluation modes: Deep Research tests end-to-end capability given only a topic, while Bottom-Up provides the expert paper set to isolate organization ability. Evaluating 7 Deep Research Agents and 12 frontier LLMs reveals a dual bottleneck: the best agent retrieves only 20.92% of expert-cited papers, and even with perfect input, the best model achieves only 31.24% ARI with substantial structural gaps. Our benchmark is publicly available at https://github.com/KongLongGeFDU/TaxoBench