🤖 AI Summary
Existing evaluation methods for deep research agents are inadequate due to the absence of ground-truth answers, the multidimensional nature of output quality, and the susceptibility of static assessments to superficial fluency. To address these limitations, this work proposes DREAM, a novel framework that introduces the principle of "capability parity" by deploying a tool-augmented evaluator capable of aligning its reasoning capacity with that of the agent under evaluation. DREAM integrates query-agnostic metrics with adaptive proxy-based indicators, enabling time-aware assessment, factual verification, and systematic probing of reasoning capabilities—establishing the first dynamic, reference-free, and multidimensional evaluation paradigm for deep research. Experimental results demonstrate that DREAM significantly outperforms existing benchmarks in detecting factual inaccuracies and temporal decay, offering a scalable solution for evaluating complex research-oriented agents.
📝 Abstract
Deep Research Agents generate analyst-grade reports, yet evaluating them remains challenging due to the absence of a single ground truth and the multidimensional nature of research quality. Recent benchmarks propose distinct methodologies, yet they suffer from the Mirage of Synthesis, where strong surface-level fluency and citation alignment can obscure underlying factual and reasoning defects. We characterize this gap by introducing a taxonomy across four verticals that exposes a critical capability mismatch: static evaluators inherently lack the tool-use capabilities required to assess temporal validity and factual correctness. To address this, we propose DREAM (Deep Research Evaluation with Agentic Metrics), a framework that instantiates the principle of capability parity by making evaluation itself agentic. DREAM structures assessment through an evaluation protocol combining query-agnostic metrics with adaptive metrics generated by a tool-calling agent, enabling temporally aware coverage, grounded verification, and systematic reasoning probes. Controlled evaluations demonstrate DREAM is significantly more sensitive to factual and temporal decay than existing benchmarks, offering a scalable, reference-free evaluation paradigm.