Surfer 2: The Next Generation of Cross-Platform Computer Use Agents

📅 2025-10-22

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

Existing computer-use agents rely on platform-specific interfaces, hindering cross-environment deployment. To address this, we propose the first general-purpose vision-driven cross-platform operating agent. Our method employs hierarchical context management, decoupled planning and execution, self-verification feedback, and a multi-attempt decision mechanism—enabling end-to-end cross-platform adaptation without task-specific fine-tuning. The agent uniformly supports web (WebVoyager/WebArena), desktop (OSWorld), and mobile (AndroidWorld) environments. It achieves state-of-the-art accuracy of 97.1%, 69.6%, 60.1%, and 87.1% across four benchmark datasets, respectively; under multi-attempt evaluation, its overall performance surpasses human baselines. This work establishes the first SOTA-level cross-platform generalization capability and enables robust long-horizon task adaptation and autonomous recovery—marking a significant advance in universal, vision-based agent design.

Technology Category

Application Category

📝 Abstract

Building agents that generalize across web, desktop, and mobile environments remains an open challenge, as prior systems rely on environment-specific interfaces that limit cross-platform deployment. We introduce Surfer 2, a unified architecture operating purely from visual observations that achieves state-of-the-art performance across all three environments. Surfer 2 integrates hierarchical context management, decoupled planning and execution, and self-verification with adaptive recovery, enabling reliable operation over long task horizons. Our system achieves 97.1% accuracy on WebVoyager, 69.6% on WebArena, 60.1% on OSWorld, and 87.1% on AndroidWorld, outperforming all prior systems without task-specific fine-tuning. With multiple attempts, Surfer 2 exceeds human performance on all benchmarks. These results demonstrate that systematic orchestration amplifies foundation model capabilities and enables general-purpose computer control through visual interaction alone, while calling for a next-generation vision language model to achieve Pareto-optimal cost-efficiency.

Problem

Research questions and friction points this paper is trying to address.

Building agents that generalize across web, desktop, and mobile platforms

Overcoming environment-specific interface limitations for cross-platform deployment

Achieving general-purpose computer control through visual interaction alone

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified visual architecture for cross-platform control

Hierarchical context management with decoupled planning

Self-verification with adaptive recovery mechanisms

🔎 Similar Papers

GUI Odyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices

2024-06-12arXiv.orgCitations: 47

ByteDance

西雅图

Research Engineer - Perception and Machine Learning