Surfer-H Meets Holo1: Cost-Efficient Web Agent Powered by Open Weights

📅 2025-06-03

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

This work addresses the challenge of building low-cost, high-performance web agents. We propose Surfer-H, a novel framework, and Holo1, a family of open-source vision-language models (VLMs) specifically designed for web UI understanding and navigation. Holo1 integrates synthetic data generation, self-generated agentic training data, and precise UI element localization. Evaluated on WebVoyager, it achieves a state-of-the-art accuracy of 92.2%; it also significantly outperforms prior methods on our newly introduced web interaction benchmark, WebClick, and established general UI benchmarks. To our knowledge, this is the first open-source VLM family dedicated to web UI perception. We fully release Holo1’s model weights and the WebClick dataset, achieving a Pareto-optimal trade-off between accuracy and inference cost. The framework provides an efficient, reproducible, end-to-end baseline for automated web task execution.

Technology Category

Application Category

📝 Abstract

We present Surfer-H, a cost-efficient web agent that integrates Vision-Language Models (VLM) to perform user-defined tasks on the web. We pair it with Holo1, a new open-weight collection of VLMs specialized in web navigation and information extraction. Holo1 was trained on carefully curated data sources, including open-access web content, synthetic examples, and self-produced agentic data. Holo1 tops generalist User Interface (UI) benchmarks as well as our new web UI localization benchmark, WebClick. When powered by Holo1, Surfer-H achieves a 92.2% state-of-the-art performance on WebVoyager, striking a Pareto-optimal balance between accuracy and cost-efficiency. To accelerate research advancement in agentic systems, we are open-sourcing both our WebClick evaluation dataset and the Holo1 model weights.

Problem

Research questions and friction points this paper is trying to address.

Develop cost-efficient web agent for user tasks

Improve web navigation with specialized Vision-Language Models

Balance accuracy and cost in web UI performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates Vision-Language Models for web tasks

Uses open-weight Holo1 VLMs for navigation

Achieves cost-efficiency with high accuracy

🔎 Similar Papers

No similar papers found.