Needle in the Web: A Benchmark for Retrieving Targeted Web Pages in the Wild

๐Ÿ“… 2025-12-18
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing benchmarks (e.g., BrowseComp, xBench-DeepSearch) emphasize multi-hop fact retrieval but overlook fuzzy exploratory search (FES)โ€”a semantically ambiguous, goal-open information-seeking task prevalent in real-world web navigation. Method: This work formally defines and quantifies FES for the first time, introducing the first FES benchmark tailored to realistic web retrieval: it generates controllable-difficulty queries from Web-based factual statements, spanning seven domains and comprising 663 high-quality questions; it further proposes an evaluation framework integrating content reconstruction, difficulty calibration, and multi-model/agent collaborative assessment. Contribution/Results: Experiments reveal that three state-of-the-art LLMs and three representative search agents achieve sub-35% average accuracy; none demonstrate robust performance across domains or difficulty levelsโ€”exposing a fundamental limitation of current systems in handling semantic ambiguity during retrieval.

Technology Category

Application Category

๐Ÿ“ Abstract
Large Language Models (LLMs) have evolved from simple chatbots into sophisticated agents capable of automating complex real-world tasks, where browsing and reasoning over live web content is key to assessing retrieval and cognitive skills. Existing benchmarks like BrowseComp and xBench-DeepSearch emphasize complex reasoning searches requiring multi-hop synthesis but neglect Fuzzy Exploratory Search, namely queries that are vague and multifaceted, where users seek the most relevant webpage rather than a single factual answer. To address this gap, we introduce Needle in the Web, a novel benchmark specifically designed to evaluate modern search agents and LLM-based systems on their ability to retrieve and reason over real-world web content in response to ambiguous, exploratory queries under varying levels of difficulty. Needle in the Web comprises 663 questions spanning seven distinct domains. To ensure high query quality and answer uniqueness, we employ a flexible methodology that reliably generates queries of controllable difficulty based on factual claims of web contents. We benchmark three leading LLMs and three agent-based search systems on Needle in the Web, finding that most models struggle: many achieve below 35% accuracy, and none consistently excel across domains or difficulty levels. These findings reveal that Needle in the Web presents a significant challenge for current search systems and highlights the open problem of effective fuzzy retrieval under semantic ambiguity.
Problem

Research questions and friction points this paper is trying to address.

Evaluates LLMs' ability to retrieve web pages for vague, multifaceted queries
Assesses search systems on real-world web content across varying difficulty levels
Addresses the gap in benchmarks for fuzzy exploratory search tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark for ambiguous exploratory web queries
Generates queries with controllable difficulty levels
Evaluates retrieval over real-world web content
๐Ÿ”Ž Similar Papers
No similar papers found.