How Far Are We from Genuinely Useful Deep Research Agents?

📅 2025-12-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current deep research agents (DRAs) exhibit significant deficiencies in generating analyst-grade comprehensive reports: mainstream evaluations focus narrowly on question-answering tasks, neglecting report generation capabilities; existing benchmarks suffer from low task complexity and subjective metrics, failing to reflect real-world requirements. To address these gaps, we introduce FINDER—a fine-grained Deep Research benchmark comprising 100 manually curated research tasks and 419 structured evaluation items—and DEFT, the first systematic taxonomy of DRA failure modes. Leveraging human–LLM collaborative annotation and grounded theory analysis, we conduct a multidimensional empirical evaluation of leading DRAs, revealing that their core bottlenecks lie in evidence integration, cross-source verification, and resilient reasoning planning—not task understanding. FINDER and DEFT establish a reproducible, scalable paradigm for standardized DRA evaluation and capability advancement.

Technology Category

Application Category

📝 Abstract
Deep Research Agents (DRAs) aim to automatically produce analyst-level reports through iterative information retrieval and synthesis. However, most existing DRAs were validated on question-answering benchmarks, while research on generating comprehensive reports remains overlooked. Worse, current benchmarks for report synthesis suffer from task complexity and subjective metrics -- this fails to reflect user demands and limits the practical utility of generated reports. To address these gaps, we present Fine-grained DEepResearch bench (FINDER), an enhanced benchmark consisting of 100 human-curated research tasks with 419 structured checklist items that standardize report structure, analytical depth, and factual grounding. Based on approximately 1,000 reports produced by mainstream DRAs, we further propose Deep rEsearch Failure Taxonomy (DEFT), the first failure taxonomy for deep research agents. DEFT contains 14 fine-grained failure modes across reasoning, retrieval, and generation, and is built upon grounded theory with human-LLM co-annotating and inter-annotator reliability validation. Our experimental findings reveal that current DRAs struggle not with task comprehension but with evidence integration, verification, and reasoning-resilient planning.
Problem

Research questions and friction points this paper is trying to address.

Benchmarks lack realistic report synthesis tasks and objective metrics
Existing Deep Research Agents fail in evidence integration and verification
Need standardized evaluation for comprehensive research report generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces FINDER benchmark with structured checklist items
Proposes DEFT taxonomy for deep research agent failures
Identifies evidence integration and verification as key challenges
D
Dingling Zhang
OPPO AI Agent Team
H
He Zhu
OPPO AI Agent Team
Jincheng Ren
Jincheng Ren
Hohai University
LLMNLP
K
Kangqi Song
OPPO AI Agent Team
X
Xinran Zhou
OPPO AI Agent Team
B
Boyu Feng
OPPO AI Agent Team
Shudong Liu
Shudong Liu
University of Macau
Natural Language ProcessingLarge Language Models
J
Jiabin Luo
OPPO AI Agent Team
W
Weihao Xie
OPPO AI Agent Team
Z
Zhaohui Wang
OPPO AI Agent Team
Tianrui Qin
Tianrui Qin
OPPO
Agentic AIDeep LearningLLM Security
K
King Zhu
OPPO AI Agent Team
Y
Yuqing Wang
OPPO AI Agent Team
Q
Qianben Chen
OPPO AI Agent Team
Yuchen Eleanor Jiang
Yuchen Eleanor Jiang
OPPO
natural language processingmachine learning
W
Wei Wang
OPPO AI Agent Team
J
Jiaheng Liu
OPPO AI Agent Team
Wangchunshu Zhou
Wangchunshu Zhou
OPPO & M-A-P
artificial general intelligencelanguage agentslarge language modelsnatural language processing