DR$^{3}$-Eval: Towards Realistic and Reproducible Deep Research Evaluation

📅 2026-04-16
📈 Citations: 0
Influential: 0
📄 PDF

career value

226K/year
🤖 AI Summary
This work addresses the challenge of evaluating deep research agents in complex, long-horizon scientific tasks, where dynamic environments and ambiguous objectives hinder reliable assessment. To this end, we introduce DR³-Eval, the first evaluation benchmark that balances realism with reproducibility by constructing a static research sandbox corpus derived from real user materials, enabling multimodal, multi-document report generation. The benchmark features a controlled environment incorporating supporting documents, distractors, and noise, alongside a multidimensional automatic evaluation framework aligned with human judgment—assessing information recall, factual accuracy, citation coverage, instruction adherence, and depth of analysis. Experimental results reveal significant shortcomings in current approaches, particularly in retrieval robustness and hallucination control. The code and data are publicly released.

Technology Category

Application Category

📝 Abstract
Deep Research Agents (DRAs) aim to solve complex, long-horizon research tasks involving planning, retrieval, multimodal understanding, and report generation, yet their evaluation remains challenging due to dynamic web environments and ambiguous task definitions. We propose DR$^{3}$-Eval, a realistic and reproducible benchmark for evaluating deep research agents on multimodal, multi-file report generation. DR$^{3}$-Eval is constructed from authentic user-provided materials and paired with a per-task static research sandbox corpus that simulates open-web complexity while remaining fully verifiable, containing supportive documents, distractors, and noise. Moreover, we introduce a multi-dimensional evaluation framework measuring Information Recall, Factual Accuracy, Citation Coverage, Instruction Following, and Depth Quality, and validate its alignment with human judgments. Experiments with our developed multi-agent system DR$^{3}$-Agent based on multiple state-of-the-art language models demonstrate that DR$^{3}$-Eval is highly challenging and reveals critical failure modes in retrieval robustness and hallucination control. Our code and data are publicly available.
Problem

Research questions and friction points this paper is trying to address.

Deep Research Agents
Evaluation Benchmark
Multimodal Report Generation
Reproducibility
Realistic Assessment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Deep Research Agents
Realistic Evaluation
Reproducible Benchmark
Multimodal Report Generation
Multi-dimensional Evaluation
🔎 Similar Papers
No similar papers found.
Qianqian Xie
Qianqian Xie
Wuhan University
NLPLLM
Q
Qingheng Xiong
Nanjing University
He Zhu
He Zhu
OPPO; M-A-P
AI AgentGraph Neural NetworksLLMsCode Intelligence
T
Tiantian Xia
Nanjing University of Science and Technology
X
Xueming Han
Jiutian Research
F
Fanyu Meng
Jiutian Research
Jiakai Wang
Jiakai Wang
Zhongguancun Laboratory
Adversarial examplesTrustworthy AI
Z
Zhiqi Bai
M-A-P
C
Chengkang Jiang
Nanjing University
Z
Zhaohui Wang
Nanjing University
Y
Yubin Guo
Nanjing University
Y
Yuqing Wen
National University of Singapore
J
Jiayang Mao
Nanjing University
Zijie Zhang
Zijie Zhang
Assistant Professor, University of Texas at San Antonio
Trustworthy Machine LearningAdversaril A/DFederated LearningGraph
S
Shihao Li
Nanjing University
Y
Yanghai Wang
Nanjing University
Yuxiang Ren
Yuxiang Ren
Tenure-track Assistant Professor, Nanjing University
Graph Neural NetworkAI for ScienceFoundation Model
Junlan Feng
Junlan Feng
Chief Scientist at China Mobile Research
Natural LanguageMachine LearningSpeech ProcessingData Mining
J
Jiaheng Liu
Nanjing University