Surfer-H Meets Holo1: Cost-Efficient Web Agent Powered by Open Weights

📅 2025-06-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of building low-cost, high-performance web agents. We propose Surfer-H, a novel framework, and Holo1, a family of open-source vision-language models (VLMs) specifically designed for web UI understanding and navigation. Holo1 integrates synthetic data generation, self-generated agentic training data, and precise UI element localization. Evaluated on WebVoyager, it achieves a state-of-the-art accuracy of 92.2%; it also significantly outperforms prior methods on our newly introduced web interaction benchmark, WebClick, and established general UI benchmarks. To our knowledge, this is the first open-source VLM family dedicated to web UI perception. We fully release Holo1’s model weights and the WebClick dataset, achieving a Pareto-optimal trade-off between accuracy and inference cost. The framework provides an efficient, reproducible, end-to-end baseline for automated web task execution.

Technology Category

Application Category

📝 Abstract
We present Surfer-H, a cost-efficient web agent that integrates Vision-Language Models (VLM) to perform user-defined tasks on the web. We pair it with Holo1, a new open-weight collection of VLMs specialized in web navigation and information extraction. Holo1 was trained on carefully curated data sources, including open-access web content, synthetic examples, and self-produced agentic data. Holo1 tops generalist User Interface (UI) benchmarks as well as our new web UI localization benchmark, WebClick. When powered by Holo1, Surfer-H achieves a 92.2% state-of-the-art performance on WebVoyager, striking a Pareto-optimal balance between accuracy and cost-efficiency. To accelerate research advancement in agentic systems, we are open-sourcing both our WebClick evaluation dataset and the Holo1 model weights.
Problem

Research questions and friction points this paper is trying to address.

Develop cost-efficient web agent for user tasks
Improve web navigation with specialized Vision-Language Models
Balance accuracy and cost in web UI performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates Vision-Language Models for web tasks
Uses open-weight Holo1 VLMs for navigation
Achieves cost-efficiency with high accuracy
🔎 Similar Papers
No similar papers found.
M
Mathieu Andreux
H Company
B
Breno Baldas Skuk
H Company
Hamza Benchekroun
Hamza Benchekroun
Core Team Researcher
Reinforcement LearningLarge Language Models
E
Emilien Bir'e
H Company
A
Antoine Bonnet
H Company
R
Riaz Bordie
H Company
M
Matthias Brunel
H Company
P
Pierre-Louis Cedoz
H Company
A
Antoine Chassang
H Company
Mickael Chen
Mickael Chen
H Company
Generative Models
A
Alexandra D. Constantinou
H Company
A
Antoine d'Andign'e
H Company
H
Hubert de La Jonquiere
H Company
A
A. Delfosse
H Company
Ludovic Denoyer
Ludovic Denoyer
Lead Agent research at H -- Full Professor at Sorbonne Universités on Sabatical
Machine LearningReinforcement LearningDeep learning
A
Alexis Deprez
H Company
A
Augustin Derupti
H Company
Michael Eickenberg
Michael Eickenberg
H Company
M
Mathis Federico
H Company
C
Charles Kantor
H Company
Xavier Koegler
Xavier Koegler
H Company
Y
Yann Labb'e
H Company
M
Matthew C. H. Lee
H Company
E
Erwan Le Jumeau de Kergaradec
H Company
A
Amir Mahla
H Company
A
Avshalom Manevich
H Company
A
Adrien Maret
H Company
C
Charles Masson
H Company
R
Rafael Maurin
H Company
A
Arturo Mena
H Company
P
Philippe Modard
H Company
A
Axel Moyal
H Company
A
Axel Nguyen Kerbel
H Company
J
Julien Revelle
H Company
M
Mats L. Richter
H Company
M
Mar'ia Santos
H Company
L
L. Sifre
H Company
M
Maxime Theillard
H Company
M
Marc Thibault
H Company
L
L. Thiry
H Company
L
Léo Tronchon
H Company
N
Nicolas Usunier
H Company
Tony Wu
Tony Wu
H Company