🤖 AI Summary
Evaluating large language model (LLM)-driven deep research tools—such as those enabling web browsing, information extraction, and multi-page report generation—for knowledge-intensive tasks like academic literature review remains challenging, due to inadequacies in existing benchmarks regarding domain coverage, factual accuracy, and linguistic quality.
Method: We propose the first systematic, multidimensional evaluation framework tailored specifically for deep research tools, integrating human judgment with automated metrics across four dimensions: retrieval breadth, information fidelity, logical coherence, and scholarly rigor.
Contribution/Results: Applying this framework to benchmark OpenAI’s and Google’s Deep Research tools on academic review generation, we identify critical capability gaps between them. Our evaluation reveals that fine-grained, task-aligned assessment significantly improves diagnostic precision, thereby guiding iterative tool development and establishing a foundation for future benchmarking standards in AI-augmented research.
📝 Abstract
Large Language Models (LLMs) powered with argentic capabilities are able to do knowledge-intensive tasks without human involvement. A prime example of this tool is Deep research with the capability to browse the web, extract information and generate multi-page reports. In this work, we introduce an evaluation sheet that can be used for assessing the capability of Deep Research tools. In addition, we selected academic survey writing as a use case task and evaluated output reports based on the evaluation sheet we introduced. Our findings show the need to have carefully crafted evaluation standards. The evaluation done on OpenAI`s Deep Search and Google's Deep Search in generating an academic survey showed the huge gap between search engines and standalone Deep Research tools, the shortcoming in representing the targeted area.