🤖 AI Summary
Existing benchmarks inadequately evaluate Deep Research Agents (DRAs) on complex, open-ended tasks due to narrow evaluation dimensions, mismatched output formats, and coarse-grained scoring mechanisms.
Method: We introduce DRA-Bench—the first rigorous, DRA-specific benchmark—comprising 214 expert-crafted questions spanning ten thematic domains, explicitly designed for long-form, report-style outputs. We propose the first multidimensional automated evaluation framework, quantifying performance along three axes: semantic quality, topical focus, and retrieval credibility. This framework integrates human-curated reference packages with cross-source retrieval and multi-stage reasoning for end-to-end fine-grained assessment.
Contribution/Results: Experiments reveal that state-of-the-art DRAs significantly outperform retrieval-augmented reasoning models; however, they exhibit notable deficiencies in information integration and logical consistency. DRA-Bench establishes a reliable, diagnostic foundation for evaluating and advancing DRA capabilities.
📝 Abstract
Artificial intelligence is undergoing the paradigm shift from closed language models to interconnected agent systems capable of external perception and information integration. As a representative embodiment, Deep Research Agents (DRAs) systematically exhibit the capabilities for task decomposition, cross-source retrieval, multi-stage reasoning, and structured output, which markedly enhance performance on complex and open-ended tasks. However, existing benchmarks remain deficient in evaluation dimensions, response formatting, and scoring mechanisms, limiting their capacity to assess such systems effectively. This paper introduces a rigorous benchmark and a multidimensional evaluation framework tailored to DRAs and report-style responses. The benchmark comprises 214 expert-curated challenging queries distributed across 10 broad thematic domains, each accompanied by manually constructed reference bundles to support composite evaluation. The framework enables comprehensive evaluation of long-form reports generated by DRAs, incorporating integrated scoring metrics for semantic quality, topical focus, and retrieval trustworthiness. Extensive experimentation confirms the superior performance of mainstream DRAs over web-search-tool-augmented reasoning models, yet reveals considerable scope for further improvement. This study provides a robust foundation for capability assessment, architectural refinement, and paradigm advancement in DRA systems.