Surfer 2: The Next Generation of Cross-Platform Computer Use Agents

📅 2025-10-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing computer-use agents rely on platform-specific interfaces, hindering cross-environment deployment. To address this, we propose the first general-purpose vision-driven cross-platform operating agent. Our method employs hierarchical context management, decoupled planning and execution, self-verification feedback, and a multi-attempt decision mechanism—enabling end-to-end cross-platform adaptation without task-specific fine-tuning. The agent uniformly supports web (WebVoyager/WebArena), desktop (OSWorld), and mobile (AndroidWorld) environments. It achieves state-of-the-art accuracy of 97.1%, 69.6%, 60.1%, and 87.1% across four benchmark datasets, respectively; under multi-attempt evaluation, its overall performance surpasses human baselines. This work establishes the first SOTA-level cross-platform generalization capability and enables robust long-horizon task adaptation and autonomous recovery—marking a significant advance in universal, vision-based agent design.

Technology Category

Application Category

📝 Abstract
Building agents that generalize across web, desktop, and mobile environments remains an open challenge, as prior systems rely on environment-specific interfaces that limit cross-platform deployment. We introduce Surfer 2, a unified architecture operating purely from visual observations that achieves state-of-the-art performance across all three environments. Surfer 2 integrates hierarchical context management, decoupled planning and execution, and self-verification with adaptive recovery, enabling reliable operation over long task horizons. Our system achieves 97.1% accuracy on WebVoyager, 69.6% on WebArena, 60.1% on OSWorld, and 87.1% on AndroidWorld, outperforming all prior systems without task-specific fine-tuning. With multiple attempts, Surfer 2 exceeds human performance on all benchmarks. These results demonstrate that systematic orchestration amplifies foundation model capabilities and enables general-purpose computer control through visual interaction alone, while calling for a next-generation vision language model to achieve Pareto-optimal cost-efficiency.
Problem

Research questions and friction points this paper is trying to address.

Building agents that generalize across web, desktop, and mobile platforms
Overcoming environment-specific interface limitations for cross-platform deployment
Achieving general-purpose computer control through visual interaction alone
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified visual architecture for cross-platform control
Hierarchical context management with decoupled planning
Self-verification with adaptive recovery mechanisms
🔎 Similar Papers
No similar papers found.
M
Mathieu Andreux
H Company
M
Märt Bakler
H Company
Y
Yanael Barbier
H Company
H
Hamza Ben Chekroun
H Company
Emilien Biré
Emilien Biré
CentraleSupélec
Deep Learning
A
Antoine Bonnet
H Company
R
Riaz Bordie
H Company
N
Nathan Bout
H Company
M
Matthias Brunel
H Company
A
Aleix Cambray
H Company
P
Pierre-Louis Cedoz
H Company
A
Antoine Chassang
H Company
G
Gautier Cloix
H Company
E
Ethan Connelly
H Company
A
Alexandra Constantinou
H Company
R
Ramzi De Coster
H Company
H
Hubert de la Jonquiere
H Company
A
Aurélien Delfosse
H Company
M
Maxime Delpit
H Company
A
Alexis Deprez
H Company
A
Augustin Derupti
H Company
M
Mathieu Diaz
H Company
S
Shannon D'Souza
H Company
J
Julie Dujardin
H Company
A
Abai Edmund
H Company