🤖 AI Summary
Existing benchmarks predominantly target simple tasks or generic web retrieval, failing to assess AI agents’ capabilities in enterprise-grade deep research involving complex, multi-step reasoning across heterogeneous data sources.
Method: We introduce the first enterprise-oriented benchmark for deep research evaluation, spanning ten domains—including sales, cybersecurity, and compliance—requiring joint reasoning over public web content and private enterprise knowledge (e.g., emails, chat logs, cloud documents). We propose a human-in-the-loop synthetic pipeline leveraging LLMs (GPT, Llama, Qwen) and domain-specific research strategies to generate 15 reproducible tasks. Our evaluation framework uniquely integrates enterprise context and multimodal data, incorporating expert validation to assess factual accuracy, information recall, and report structuring.
Contribution/Results: Empirical analysis exposes critical limitations of state-of-the-art models in industrial-scale research tasks, establishing a foundational, standardized benchmark for rigorous AI agent evaluation in enterprise settings.
📝 Abstract
We introduce DRBench, a benchmark for evaluating AI agents on complex, open-ended deep research tasks in enterprise settings. Unlike prior benchmarks that focus on simple questions or web-only queries, DRBench evaluates agents on multi-step queries (for example, ``What changes should we make to our product roadmap to ensure compliance with this standard?") that require identifying supporting facts from both the public web and private company knowledge base. Each task is grounded in realistic user personas and enterprise context, spanning a heterogeneous search space that includes productivity software, cloud file systems, emails, chat conversations, and the open web. Tasks are generated through a carefully designed synthesis pipeline with human-in-the-loop verification, and agents are evaluated on their ability to recall relevant insights, maintain factual accuracy, and produce coherent, well-structured reports. We release 15 deep research tasks across 10 domains, such as Sales, Cybersecurity, and Compliance. We demonstrate the effectiveness of DRBench by evaluating diverse DR agents across open- and closed-source models (such as GPT, Llama, and Qwen) and DR strategies, highlighting their strengths, weaknesses, and the critical path for advancing enterprise deep research. Code is available at https://github.com/ServiceNow/drbench.