WebGames: Challenging General-Purpose Web-Browsing AI Agents

📅 2025-02-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the lack of standardized evaluation for general-purpose web-browsing AI agents. We introduce WebGames, the first fully client-side, dependency-free benchmark comprising 50+ interactive challenges spanning five dimensions: browser navigation, input comprehension, cognitive reasoning, workflow automation, and entertainment interaction. WebGames provides verifiable ground-truth solutions and a standardized evaluation protocol. Our systematic assessment reveals a substantial capability gap between state-of-the-art multimodal models (e.g., GPT-4o) and humans in everyday web interactions: average success rates are only 43.1% versus 95.7% for human users. The benchmark employs a lightweight, sandboxed execution environment enabling rapid iteration and fair, reproducible evaluation. WebGames is fully open-sourced and establishes a new de facto standard for evaluating web-based autonomous agents.

Technology Category

Application Category

📝 Abstract
We introduce WebGames, a comprehensive benchmark suite designed to evaluate general-purpose web-browsing AI agents through a collection of 50+ interactive challenges. These challenges are specifically crafted to be straightforward for humans while systematically testing the limitations of current AI systems across fundamental browser interactions, advanced input processing, cognitive tasks, workflow automation, and interactive entertainment. Our framework eliminates external dependencies through a hermetic testing environment, ensuring reproducible evaluation with verifiable ground-truth solutions. We evaluate leading vision-language models including GPT-4o, Claude Computer-Use, Gemini-1.5-Pro, and Qwen2-VL against human performance. Results reveal a substantial capability gap, with the best AI system achieving only 43.1% success rate compared to human performance of 95.7%, highlighting fundamental limitations in current AI systems' ability to handle common web interaction patterns that humans find intuitive. The benchmark is publicly available at webgames.convergence.ai, offering a lightweight, client-side implementation that facilitates rapid evaluation cycles. Through its modular architecture and standardized challenge specifications, WebGames provides a robust foundation for measuring progress in development of more capable web-browsing agents.
Problem

Research questions and friction points this paper is trying to address.

Evaluates AI web-browsing capabilities
Tests AI on interactive web challenges
Highlights AI-human performance gap
Innovation

Methods, ideas, or system contributions that make the work stand out.

comprehensive benchmark suite
hermetic testing environment
standardized challenge specifications
🔎 Similar Papers
No similar papers found.
G
George Thomas
Convergence Labs Ltd., Clusterfudge Ltd.
Alex J. Chan
Alex J. Chan
Director of Engineering, Salesforce
Machine LearningInverse Reinforcement LearningImitation Learning
Jikun Kang
Jikun Kang
LMTS at Salesforce
Machine LeanringReinforcement Learning
W
Wenqi Wu
Convergence Labs Ltd.
Filippos Christianos
Filippos Christianos
University of Edinburgh
F
Fraser Greenlee
Convergence Labs Ltd.
A
Andy Toulis
Convergence Labs Ltd.
M
Marvin Purtorab
Convergence Labs Ltd.