DREAM: Deep Research Evaluation with Agentic Metrics

📅 2026-02-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing evaluation methods for deep research agents are inadequate due to the absence of ground-truth answers, the multidimensional nature of output quality, and the susceptibility of static assessments to superficial fluency. To address these limitations, this work proposes DREAM, a novel framework that introduces the principle of "capability parity" by deploying a tool-augmented evaluator capable of aligning its reasoning capacity with that of the agent under evaluation. DREAM integrates query-agnostic metrics with adaptive proxy-based indicators, enabling time-aware assessment, factual verification, and systematic probing of reasoning capabilities—establishing the first dynamic, reference-free, and multidimensional evaluation paradigm for deep research. Experimental results demonstrate that DREAM significantly outperforms existing benchmarks in detecting factual inaccuracies and temporal decay, offering a scalable solution for evaluating complex research-oriented agents.

Technology Category

Application Category

📝 Abstract
Deep Research Agents generate analyst-grade reports, yet evaluating them remains challenging due to the absence of a single ground truth and the multidimensional nature of research quality. Recent benchmarks propose distinct methodologies, yet they suffer from the Mirage of Synthesis, where strong surface-level fluency and citation alignment can obscure underlying factual and reasoning defects. We characterize this gap by introducing a taxonomy across four verticals that exposes a critical capability mismatch: static evaluators inherently lack the tool-use capabilities required to assess temporal validity and factual correctness. To address this, we propose DREAM (Deep Research Evaluation with Agentic Metrics), a framework that instantiates the principle of capability parity by making evaluation itself agentic. DREAM structures assessment through an evaluation protocol combining query-agnostic metrics with adaptive metrics generated by a tool-calling agent, enabling temporally aware coverage, grounded verification, and systematic reasoning probes. Controlled evaluations demonstrate DREAM is significantly more sensitive to factual and temporal decay than existing benchmarks, offering a scalable, reference-free evaluation paradigm.
Problem

Research questions and friction points this paper is trying to address.

Deep Research Evaluation
Agentic Metrics
Factual Correctness
Temporal Validity
Research Quality
Innovation

Methods, ideas, or system contributions that make the work stand out.

agentic evaluation
capability parity
tool-augmented reasoning
temporal validity
reference-free benchmarking