WideSearch: Benchmarking Agentic Broad Info-Seeking

📅 2025-08-11

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing benchmarks inadequately assess the reliability and completeness of LLM-driven search agents in large-scale information gathering tasks. Method: We introduce WideSearch—the first high-quality, cross-lingual, multi-domain benchmark explicitly designed to evaluate agents’ wide-context information collection capabilities. It comprises 200 real-world user queries and a rigorous five-stage quality control pipeline. Our evaluation methodology integrates single- and multi-agent frameworks, structured output generation, human verification, and cross-validation to enable fine-grained, verifiable, item-level assessment. Contribution/Results: Experiments reveal that state-of-the-art search agents achieve only ~5% success rate on WideSearch—dramatically lower than near-perfect human performance (~100%). This constitutes the first systematic demonstration of a fundamental bottleneck in autonomous agents’ ability to integrate information across broad, heterogeneous domains.

Technology Category

Application Category

📝 Abstract

From professional research to everyday planning, many tasks are bottlenecked by wide-scale information seeking, which is more repetitive than cognitively complex. With the rapid development of Large Language Models (LLMs), automated search agents powered by LLMs offer a promising solution to liberate humans from this tedious work. However, the capability of these agents to perform such "wide-context" collection reliably and completely remains largely unevaluated due to a lack of suitable benchmarks. To bridge this gap, we introduce WideSearch, a new benchmark engineered to evaluate agent reliability on these large-scale collection tasks. The benchmark features 200 manually curated questions (100 in English, 100 in Chinese) from over 15 diverse domains, grounded in real user queries. Each task requires agents to collect large-scale atomic information, which could be verified one by one objectively, and arrange it into a well-organized output. A rigorous five-stage quality control pipeline ensures the difficulty, completeness, and verifiability of the dataset. We benchmark over 10 state-of-the-art agentic search systems, including single-agent, multi-agent frameworks, and end-to-end commercial systems. Most systems achieve overall success rates near 0%, with the best performer reaching just 5%. However, given sufficient time, cross-validation by multiple human testers can achieve a near 100% success rate. These results demonstrate that present search agents have critical deficiencies in large-scale information seeking, underscoring urgent areas for future research and development in agentic search. Our dataset, evaluation pipeline, and benchmark results have been publicly released at https://widesearch-seed.github.io/

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM agents' reliability in large-scale information collection

Lack of benchmarks for wide-context agentic search capabilities

Assessing agent performance on diverse, verifiable info-seeking tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces WideSearch benchmark for agent evaluation

Uses 200 manually curated multi-domain questions

Tests 10 state-of-the-art agentic search systems

🔎 Similar Papers

MMInA: Benchmarking Multihop Multimodal Internet Agents

2024-04-15arXiv.orgCitations: 22

Authors to Follow