DRBench: A Realistic Benchmark for Enterprise Deep Research

📅 2025-09-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing benchmarks predominantly target simple tasks or generic web retrieval, failing to assess AI agents’ capabilities in enterprise-grade deep research involving complex, multi-step reasoning across heterogeneous data sources. Method: We introduce the first enterprise-oriented benchmark for deep research evaluation, spanning ten domains—including sales, cybersecurity, and compliance—requiring joint reasoning over public web content and private enterprise knowledge (e.g., emails, chat logs, cloud documents). We propose a human-in-the-loop synthetic pipeline leveraging LLMs (GPT, Llama, Qwen) and domain-specific research strategies to generate 15 reproducible tasks. Our evaluation framework uniquely integrates enterprise context and multimodal data, incorporating expert validation to assess factual accuracy, information recall, and report structuring. Contribution/Results: Empirical analysis exposes critical limitations of state-of-the-art models in industrial-scale research tasks, establishing a foundational, standardized benchmark for rigorous AI agent evaluation in enterprise settings.

Technology Category

Application Category

📝 Abstract
We introduce DRBench, a benchmark for evaluating AI agents on complex, open-ended deep research tasks in enterprise settings. Unlike prior benchmarks that focus on simple questions or web-only queries, DRBench evaluates agents on multi-step queries (for example, ``What changes should we make to our product roadmap to ensure compliance with this standard?") that require identifying supporting facts from both the public web and private company knowledge base. Each task is grounded in realistic user personas and enterprise context, spanning a heterogeneous search space that includes productivity software, cloud file systems, emails, chat conversations, and the open web. Tasks are generated through a carefully designed synthesis pipeline with human-in-the-loop verification, and agents are evaluated on their ability to recall relevant insights, maintain factual accuracy, and produce coherent, well-structured reports. We release 15 deep research tasks across 10 domains, such as Sales, Cybersecurity, and Compliance. We demonstrate the effectiveness of DRBench by evaluating diverse DR agents across open- and closed-source models (such as GPT, Llama, and Qwen) and DR strategies, highlighting their strengths, weaknesses, and the critical path for advancing enterprise deep research. Code is available at https://github.com/ServiceNow/drbench.
Problem

Research questions and friction points this paper is trying to address.

Evaluating AI agents on complex enterprise research tasks
Assessing multi-step queries across public and private data
Measuring recall accuracy and report coherence in realistic scenarios
Innovation

Methods, ideas, or system contributions that make the work stand out.

DRBench benchmark evaluates multi-step enterprise queries
Integrates public web and private knowledge base sources
Uses human-verified synthesis pipeline for task generation
🔎 Similar Papers
No similar papers found.