🤖 AI Summary
This work addresses the challenge of building low-cost, high-performance web agents. We propose Surfer-H, a novel framework, and Holo1, a family of open-source vision-language models (VLMs) specifically designed for web UI understanding and navigation. Holo1 integrates synthetic data generation, self-generated agentic training data, and precise UI element localization. Evaluated on WebVoyager, it achieves a state-of-the-art accuracy of 92.2%; it also significantly outperforms prior methods on our newly introduced web interaction benchmark, WebClick, and established general UI benchmarks. To our knowledge, this is the first open-source VLM family dedicated to web UI perception. We fully release Holo1’s model weights and the WebClick dataset, achieving a Pareto-optimal trade-off between accuracy and inference cost. The framework provides an efficient, reproducible, end-to-end baseline for automated web task execution.
📝 Abstract
We present Surfer-H, a cost-efficient web agent that integrates Vision-Language Models (VLM) to perform user-defined tasks on the web. We pair it with Holo1, a new open-weight collection of VLMs specialized in web navigation and information extraction. Holo1 was trained on carefully curated data sources, including open-access web content, synthetic examples, and self-produced agentic data. Holo1 tops generalist User Interface (UI) benchmarks as well as our new web UI localization benchmark, WebClick. When powered by Holo1, Surfer-H achieves a 92.2% state-of-the-art performance on WebVoyager, striking a Pareto-optimal balance between accuracy and cost-efficiency. To accelerate research advancement in agentic systems, we are open-sourcing both our WebClick evaluation dataset and the Holo1 model weights.