🤖 AI Summary
Existing computer-use agents rely on platform-specific interfaces, hindering cross-environment deployment. To address this, we propose the first general-purpose vision-driven cross-platform operating agent. Our method employs hierarchical context management, decoupled planning and execution, self-verification feedback, and a multi-attempt decision mechanism—enabling end-to-end cross-platform adaptation without task-specific fine-tuning. The agent uniformly supports web (WebVoyager/WebArena), desktop (OSWorld), and mobile (AndroidWorld) environments. It achieves state-of-the-art accuracy of 97.1%, 69.6%, 60.1%, and 87.1% across four benchmark datasets, respectively; under multi-attempt evaluation, its overall performance surpasses human baselines. This work establishes the first SOTA-level cross-platform generalization capability and enables robust long-horizon task adaptation and autonomous recovery—marking a significant advance in universal, vision-based agent design.
📝 Abstract
Building agents that generalize across web, desktop, and mobile environments remains an open challenge, as prior systems rely on environment-specific interfaces that limit cross-platform deployment. We introduce Surfer 2, a unified architecture operating purely from visual observations that achieves state-of-the-art performance across all three environments. Surfer 2 integrates hierarchical context management, decoupled planning and execution, and self-verification with adaptive recovery, enabling reliable operation over long task horizons. Our system achieves 97.1% accuracy on WebVoyager, 69.6% on WebArena, 60.1% on OSWorld, and 87.1% on AndroidWorld, outperforming all prior systems without task-specific fine-tuning. With multiple attempts, Surfer 2 exceeds human performance on all benchmarks. These results demonstrate that systematic orchestration amplifies foundation model capabilities and enables general-purpose computer control through visual interaction alone, while calling for a next-generation vision language model to achieve Pareto-optimal cost-efficiency.