Benchmarking Deep Search over Heterogeneous Enterprise Data

📅 2025-06-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Deep search over enterprise heterogeneous data (e.g., documents, meeting notes, Slack messages, GitHub repositories, URLs) requires source-awareness and multi-hop reasoning, yet existing methods fail to retrieve sparse, interlinked evidence comprehensively—severely degrading RAG performance. Method: We introduce the first synthetic benchmark grounded in real business workflows (product planning, development, support), featuring a scalable multi-source data generation pipeline that produces a retrieval corpus of 39,190 artifacts and a multi-hop question set with answer annotations for fine-grained evaluation of long-context LLMs and RAG systems. Contribution/Results: Experiments show state-of-the-art agent-based RAG achieves only 32.96% average accuracy, confirming incomplete evidence retrieval as the fundamental bottleneck. This work is the first to systematically identify, quantify, and benchmark the evidence completeness challenge in deep search, providing critical infrastructure and a formal problem definition for future research.

Technology Category

Application Category

📝 Abstract
We present a new benchmark for evaluating Deep Search--a realistic and complex form of retrieval-augmented generation (RAG) that requires source-aware, multi-hop reasoning over diverse, sparsed, but related sources. These include documents, meeting transcripts, Slack messages, GitHub, and URLs, which vary in structure and often contain human-to-human interactions. We build it using a synthetic data pipeline that simulates business workflows across product planning, development, and support stages, generating interconnected content with realistic noise and multi-hop questions with guaranteed ground-truth answers. We release our benchmark with both answerable and unanswerable queries, and retrieval pool of 39,190 enterprise artifacts, enabling fine-grained evaluation of long-context LLM and RAG systems. Our experiments reveal that even the best-performing agentic RAG methods achieve an average performance score of 32.96 on our benchmark. With further analysis, we highlight retrieval as the main bottleneck: existing methods struggle to conduct deep searches and retrieve all necessary evidence. Consequently, they often reason over partial context, leading to significant performance degradation.
Problem

Research questions and friction points this paper is trying to address.

Evaluating Deep Search in complex RAG systems
Handling multi-hop reasoning over diverse data sources
Addressing retrieval bottlenecks in enterprise data search
Innovation

Methods, ideas, or system contributions that make the work stand out.

Synthetic data pipeline simulates business workflows
Multi-hop questions with guaranteed ground-truth answers
Retrieval pool of 39,190 diverse enterprise artifacts
🔎 Similar Papers
No similar papers found.
Prafulla Kumar Choubey
Prafulla Kumar Choubey
Salesforce AI Research
Natural Language ProcessingMachine Learning
X
Xiangyu Peng
Salesforce AI Research
S
Shilpa Bhagavath
Salesforce AI Research
K
Kung-Hsiang Huang
Salesforce AI Research
Caiming Xiong
Caiming Xiong
Salesforce Research
Machine LearningNLPComputer VisionMultimediaData Mining
C
Chien-Sheng Wu
Salesforce AI Research