🤖 AI Summary
Large language models (LLMs) struggle to deeply comprehend and integrate information scattered across web pages.
Method: This paper introduces WebWalker—a novel exploration-critique dual-role multi-agent framework enabling human-like multi-hop web navigation. It incorporates RAG-enhanced vertical and horizontal cross-page retrieval, dynamic URL path planning, and content verification mechanisms. Additionally, we propose WebWalkerQA, the first benchmark explicitly designed for evaluating structured web traversal.
Contribution/Results: Experiments demonstrate that WebWalker significantly improves complex question answering accuracy on WebWalkerQA, achieving a 37.2% absolute gain over baseline methods on multi-hop reasoning tasks. To our knowledge, this is the first systematic validation of LLMs’ capability to acquire deep, contextualized web information through active, goal-directed navigation—demonstrating both effectiveness and scalability in real-world web interaction scenarios.
📝 Abstract
Retrieval-augmented generation (RAG) demonstrates remarkable performance across tasks in open-domain question-answering. However, traditional search engines may retrieve shallow content, limiting the ability of LLMs to handle complex, multi-layered information. To address it, we introduce WebWalkerQA, a benchmark designed to assess the ability of LLMs to perform web traversal. It evaluates the capacity of LLMs to traverse a website's subpages to extract high-quality data systematically. We propose WebWalker, which is a multi-agent framework that mimics human-like web navigation through an explore-critic paradigm. Extensive experimental results show that WebWalkerQA is challenging and demonstrates the effectiveness of RAG combined with WebWalker, through the horizontal and vertical integration in real-world scenarios.