Understanding DeepResearch via Reports

📅 2025-10-09

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

Existing benchmarks inadequately assess the holistic capabilities of deep research agents in open-domain scientific inquiry—particularly in multi-source information fusion, insight generation, and report synthesis. To address this gap, we propose DeepResearch-ReportEval, a novel evaluation framework featuring an LLM-as-a-Judge paradigm that quantifies performance across three dimensions: quality, redundancy, and factual consistency. It introduces a standardized benchmark comprising 100 real-world scientific queries spanning 12 categories. The framework integrates multi-tool orchestration, cross-source information fusion, and human-calibrated evaluation to achieve high agreement with domain experts (Spearman’s ρ > 0.92). We conduct the first systematic evaluation of four leading commercial research agents, uncovering critical design trade-offs and capability disparities. All code and data are publicly released to foster reproducible research.

Technology Category

Application Category

📝 Abstract

DeepResearch agents represent a transformative AI paradigm, conducting expert-level research through sophisticated reasoning and multi-tool integration. However, evaluating these systems remains critically challenging due to open-ended research scenarios and existing benchmarks that focus on isolated capabilities rather than holistic performance. Unlike traditional LLM tasks, DeepResearch systems must synthesize diverse sources, generate insights, and present coherent findings, which are capabilities that resist simple verification. To address this gap, we introduce DeepResearch-ReportEval, a comprehensive framework designed to assess DeepResearch systems through their most representative outputs: research reports. Our approach systematically measures three dimensions: quality, redundancy, and factuality, using an innovative LLM-as-a-Judge methodology achieving strong expert concordance. We contribute a standardized benchmark of 100 curated queries spanning 12 real-world categories, enabling systematic capability comparison. Our evaluation of four leading commercial systems reveals distinct design philosophies and performance trade-offs, establishing foundational insights as DeepResearch evolves from information assistants toward intelligent research partners. Source code and data are available at: https://github.com/HKUDS/DeepResearch-Eval.

Problem

Research questions and friction points this paper is trying to address.

Evaluating holistic performance of AI research agents

Assessing multi-dimensional report quality and factuality

Benchmarking open-ended research synthesis capabilities

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluates research reports using LLM-as-a-Judge methodology

Measures quality, redundancy, factuality across three dimensions

Provides standardized benchmark with 100 real-world queries

🔎 Similar Papers

DeepDiveAI: Identifying AI Related Documents in Large Scale Literature Data

2024-08-23Citations: 1