Demystifying deep search: a holistic evaluation with hint-free multi-hop questions and factorised metrics

📅 2025-10-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing RAG and web agents exhibit two critical flaws in multi-hop search evaluation: (1) question-text leakage of reasoning paths, inducing reliance on superficial cues; and (2) aggregate pass-rate metrics that obscure root causes—insufficient search depth, poor knowledge utilization, or inappropriate refusal behavior. This paper introduces WebDetective, the first prompt-free, sandboxed benchmark for controllable multi-hop search evaluation. It employs clue-free question design and a factorized evaluation framework to disentangle search sufficiency, knowledge integration capability, and principled refusal behavior. Integrated validation loops and evidence tracing enable fine-grained diagnostic analysis. Evaluation across 25 state-of-the-art models reveals widespread deficits in knowledge utilization and refusal competence. Our proposed EvidenceLoop baseline substantially improves both search depth and holistic answer quality, demonstrating the efficacy of structured evidence grounding and iterative verification.

Technology Category

Application Category

📝 Abstract
RAG (Retrieval-Augmented Generation) systems and web agents are increasingly evaluated on multi-hop deep search tasks, yet current practice suffers from two major limitations. First, most benchmarks leak the reasoning path in the question text, allowing models to follow surface cues rather than discover reasoning chains autonomously. Second, evaluation is typically reduced to a single pass rate, which collapses diverse behaviours into one score and obscures whether failures stem from inadequate search, poor knowledge use, or inappropriate refusal. To address these issues, we present WebDetective, a benchmark of hint-free multi-hop questions paired with a controlled Wikipedia sandbox that ensures full traceability of model actions, and a holistic evaluation framework that separates search sufficiency, knowledge utilisation, and refusal behaviour. Our evaluation of 25 state-of-the-art models reveals systematic weaknesses across all architectures: models struggle with knowledge utilisation despite having sufficient evidence and demonstrate near-absent appropriate refusal when evidence is lacking. These patterns expose a fundamental gap: today's systems excel at executing given reasoning paths but fail when required to discover them. We develop an agentic workflow, EvidenceLoop, that explicitly targets the challenges our benchmark identifies, incorporating verification loops and systematic evidence tracking that improve both search and synthesis capabilities. This baseline demonstrates that WebDetective's diagnostic framework can guide concrete architectural improvements, establishing our benchmark as a critical tool for developing genuinely autonomous reasoning systems rather than pattern-following agents.
Problem

Research questions and friction points this paper is trying to address.

Evaluating multi-hop reasoning without leaked hints
Separating search quality from knowledge utilization
Addressing systematic weaknesses in autonomous reasoning chains
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hint-free multi-hop questions for autonomous reasoning
Factorised metrics separating search and knowledge use
Agentic workflow with verification loops for evidence tracking
🔎 Similar Papers
No similar papers found.
Maojia Song
Maojia Song
University of Leeds
Adaptive IntelligenceNatural Language ProcessMultimodal InteractionQuestion Answering
R
Renhang Liu
Singapore University of Technology and Design (SUTD)
X
Xinyu Wang
Tongyi Lab, Alibaba Group
Y
Yong Jiang
Tongyi Lab, Alibaba Group
Pengjun Xie
Pengjun Xie
Alibaba Group
NLP/IR/ML
F
Fei Huang
Tongyi Lab, Alibaba Group
S
Soujanya Poria
Nanyang Technological University (NTU)
Jingren Zhou
Jingren Zhou
Alibaba Group, Microsoft
Cloud ComputingLarge Scale Distributed SystemsMachine LearningQuery ProcessingQuery