🤖 AI Summary
Existing open-source web agents suffer from limited long-horizon reasoning capabilities in complex, multi-step retrieval tasks, primarily due to the scarcity of high-quality, long-context reasoning data and insufficient navigation robustness. This paper introduces a model-driven exploration and query evolution framework that automatically synthesizes challenging, multi-turn QA data—featuring up to 100 tool invocations per trajectory and full support for 128K-context windows. The agent is optimized via joint supervised fine-tuning and reinforcement learning. Our key contribution is a scalable, self-improving data generation paradigm that markedly enhances cross-page reasoning and generalization—especially for small-language models. The resulting WebExplorer-8B achieves state-of-the-art performance among models ≤72B parameters on benchmarks including WebWalkerQA and FRAMES, matching the accuracy of 72B-class models with only ~16 search steps on average—setting new SOTA for its parameter scale.
📝 Abstract
The paradigm of Large Language Models (LLMs) has increasingly shifted toward agentic applications, where web browsing capabilities are fundamental for retrieving information from diverse online sources. However, existing open-source web agents either demonstrate limited information-seeking abilities on complex tasks or lack transparent implementations. In this work, we identify that the key challenge lies in the scarcity of challenging data for information seeking. To address this limitation, we introduce WebExplorer: a systematic data generation approach using model-based exploration and iterative, long-to-short query evolution. This method creates challenging query-answer pairs that require multi-step reasoning and complex web navigation. By leveraging our curated high-quality dataset, we successfully develop advanced web agent WebExplorer-8B through supervised fine-tuning followed by reinforcement learning. Our model supports 128K context length and up to 100 tool calling turns, enabling long-horizon problem solving. Across diverse information-seeking benchmarks, WebExplorer-8B achieves the state-of-the-art performance at its scale. Notably, as an 8B-sized model, WebExplorer-8B is able to effectively search over an average of 16 turns after RL training, achieving higher accuracy than WebSailor-72B on BrowseComp-en/zh and attaining the best performance among models up to 100B parameters on WebWalkerQA and FRAMES. Beyond these information-seeking tasks, our model also achieves strong generalization on the HLE benchmark even though it is only trained on knowledge-intensive QA data. These results highlight our approach as a practical path toward long-horizon web agents.