DeepWideSearch: Benchmarking Depth and Width in Agentic Information Seeking

📅 2025-10-22

📈 Citations: 0

✨ Influential: 0

career value

232K/year

🤖 AI Summary

Existing search agents struggle to simultaneously achieve deep reasoning (e.g., multi-hop path analysis) and broad coverage (i.e., large-scale information integration), limiting their effectiveness in real-world tasks such as market analysis. To address this, we introduce DeepWideSearch—the first dual-dimensional benchmark for evaluating search agents along both depth and breadth dimensions—comprising 220 complex questions across 15 domains, each requiring multi-hop retrieval paths that jointly challenge reasoning depth and information scope. We propose a systematic evaluation framework that, for the first time, identifies four critical failure modes: inadequate reflection mechanisms, over-reliance on parametric knowledge, insufficient retrieval completeness, and poor context management. Empirical results show that state-of-the-art agents achieve only a 2.39% average success rate, underscoring the substantial challenges in collaborative search. The benchmark is publicly released to enable fine-grained assessment and advancement of information retrieval agents.

Technology Category

Application Category

📝 Abstract

Current search agents fundamentally lack the ability to simultaneously perform extit{deep} reasoning over multi-hop retrieval and extit{wide}-scale information collection-a critical deficiency for real-world applications like comprehensive market analysis and business development. To bridge this gap, we introduce DeepWideSearch, the first benchmark explicitly designed to evaluate agents to integrate depth and width in information seeking. In DeepWideSearch, agents must process a large volume of data, each requiring deep reasoning over multi-hop retrieval paths. Specifically, we propose two methods to converse established datasets, resulting in a curated collection of 220 questions spanning 15 diverse domains. Extensive experiments demonstrate that even state-of-the-art agents achieve only 2.39% average success rate on DeepWideSearch, highlighting the substantial challenge of integrating depth and width search in information-seeking tasks. Furthermore, our error analysis reveals four failure modes: lack of reflection, overreliance on internal knowledge, insufficient retrieval, and context overflow-exposing key limitations in current agent architectures. We publicly release DeepWideSearch to catalyze future research on more capable and robust information-seeking agents.

Problem

Research questions and friction points this paper is trying to address.

Benchmarking agentic information seeking depth and width integration

Evaluating multi-hop reasoning with large-scale data collection

Addressing critical deficiencies in real-world search agent applications

Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark for depth-width search integration

Converts datasets into multi-hop reasoning questions

Identifies four failure modes in agent architectures

🔎 Similar Papers

MA4DIV: Multi-Agent Reinforcement Learning for Search Result Diversification