🤖 AI Summary
This work addresses the limitations of current evaluation practices for Deep Research Agents (DRAs), which predominantly rely on end-to-end outcomes and fail to capture intermediate hallucinations—such as flawed planning—that accumulate throughout the research trajectory. To overcome this, the study proposes a process-aware evaluation paradigm that audits complete research trajectories and introduces a fine-grained hallucination assessment framework. It innovatively presents the PIES taxonomy, which systematically characterizes hallucinations along two dimensions: functional components (planning vs. summarization) and error attributes (explicit vs. implicit). The authors also construct DeepHalluBench, a benchmark comprising 100 high-risk tasks. Experimental results reveal pervasive robustness deficiencies in state-of-the-art DRAs, identifying hallucination propagation and cognitive bias as fundamental flaws, thereby offering critical insights for architectural improvement.
📝 Abstract
Diagnosing the failure mechanisms of Deep Research Agents (DRAs) remains a critical challenge. Existing benchmarks predominantly rely on end-to-end evaluation, obscuring critical intermediate hallucinations, such as flawed planning, that accumulate throughout the research trajectory. To bridge this gap, we propose a shift from outcome-based to process-aware evaluation by auditing the full research trajectory. We introduce the PIES Taxonomy to categorize hallucinations along functional components (Planning vs. Summarization) and error properties (Explicit vs. Implicit). We instantiate this taxonomy into a fine-grained evaluation framework that decomposes the trajectory to rigorously quantify these hallucinations. Leveraging this framework to isolate 100 distinctively hallucination-prone tasks including adversarial scenarios, we curate DeepHalluBench. Experiments on six state-of-theart DRAs reveal that no system achieves robust reliability. Furthermore, our diagnostic analysis traces the etiology of these failures to systemic deficits, specifically hallucination propagation and cognitive biases, providing foundational insights to guide future architectural optimization. Data and code are available at https://github.com/yuhao-zhan/DeepHalluBench.