Characterizing Deep Research: A Benchmark and Formal Definition

📅 2025-08-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Deep research tasks have long lacked formal definitions and objective, reproducible evaluation benchmarks. Method: This work formally characterizes deep research by identifying “high-forking conceptual search” as its core mechanism; it decouples reasoning from report generation via intermediate representations to enhance evaluation objectivity, and introduces LiveDRBench—a benchmark comprising 100 challenging, science- and public-event–oriented tasks. Evaluation employs F1 score alongside fine-grained analysis of reasoning trajectories, quantifying source citation quality, branching breadth, and backtracking capability. Results: Existing systems achieve only 0.02–0.72 F1, with OpenAI models attaining the highest score (0.55), revealing critical deficiencies in search depth and structured reasoning. This work establishes a theoretical foundation and empirical toolkit for modeling, evaluating, and improving deep research systems.

Technology Category

Application Category

📝 Abstract
Information tasks such as writing surveys or analytical reports require complex search and reasoning, and have recently been grouped under the umbrella of extit{deep research} -- a term also adopted by recent models targeting these capabilities. Despite growing interest, the scope of the deep research task remains underdefined and its distinction from other reasoning-intensive problems is poorly understood. In this paper, we propose a formal characterization of the deep research (DR) task and introduce a benchmark to evaluate the performance of DR systems. We argue that the core defining feature of deep research is not the production of lengthy report-style outputs, but rather the high fan-out over concepts required during the search process, i.e., broad and reasoning-intensive exploration. To enable objective evaluation, we define DR using an intermediate output representation that encodes key claims uncovered during search-separating the reasoning challenge from surface-level report generation. Based on this formulation, we propose a diverse, challenging benchmark LiveDRBench with 100 challenging tasks over scientific topics (e.g., datasets, materials discovery, prior art search) and public interest events (e.g., flight incidents, movie awards). Across state-of-the-art DR systems, F1 score ranges between 0.02 and 0.72 for any sub-category. OpenAI's model performs the best with an overall F1 score of 0.55. Analysis of reasoning traces reveals the distribution over the number of referenced sources, branching, and backtracking events executed by current DR systems, motivating future directions for improving their search mechanisms and grounding capabilities. The benchmark is available at https://github.com/microsoft/LiveDRBench.
Problem

Research questions and friction points this paper is trying to address.

Defining deep research task scope and distinctions
Creating benchmark for deep research evaluation
Improving search mechanisms in deep research
Innovation

Methods, ideas, or system contributions that make the work stand out.

Formal definition of deep research task
Intermediate output representation for evaluation
Diverse benchmark LiveDRBench for testing
🔎 Similar Papers
No similar papers found.
Abhinav Java
Abhinav Java
Microsoft Research
A
Ashmit Khandelwal
Microsoft Research, Bengaluru, India
S
Sukruta Midigeshi
Microsoft Research, Bengaluru, India
Aaron Halfaker
Aaron Halfaker
Microsoft
Open ProductionSocial ScienceHuman-AIHCICSCW
Amit Deshpande
Amit Deshpande
Microsoft Research
Navin Goyal
Navin Goyal
Microsoft Research
A
Ankur Gupta
Microsoft, Redmond, USA
Nagarajan Natarajan
Nagarajan Natarajan
Researcher, Microsoft Research India
Machine LearningAI for Code
A
Amit Sharma
Microsoft Research, Bengaluru, India