The BrowserGym Ecosystem for Web Agent Research

📅 2024-12-06
🏛️ arXiv.org
📈 Citations: 3
Influential: 0
📄 PDF
🤖 AI Summary
Current web-based agent evaluation suffers from fragmented benchmarks, inconsistent evaluation criteria, and non-reproducible results—severely hindering cross-model comparison and progress assessment. To address this, we introduce the first unified evaluation ecosystem for web agent research: an extended BrowserGym framework integrating heterogeneous benchmarks, augmented with the modular AgentLab experimental platform to support flexible task expansion and end-to-end automated evaluation. Our ecosystem enforces standardized observation/action spaces, LLM-driven web interaction interfaces, and fully reproducible evaluation pipelines. We conduct a large-scale comparative study across six state-of-the-art agents and six diverse benchmarks. Results reveal, for the first time, that Claude-3.5-Sonnet achieves the highest overall performance, while GPT-4o excels in vision-intensive tasks; we further identify robustness—particularly under dynamic, noisy, or partially observable web environments—as the critical bottleneck limiting current agent capabilities.

Technology Category

Application Category

📝 Abstract
The BrowserGym ecosystem addresses the growing need for efficient evaluation and benchmarking of web agents, particularly those leveraging automation and Large Language Models (LLMs). Many existing benchmarks suffer from fragmentation and inconsistent evaluation methodologies, making it challenging to achieve reliable comparisons and reproducible results. In an earlier work, Drouin et al. (2024) introduced BrowserGym which aims to solve this by providing a unified, gym-like environment with well-defined observation and action spaces, facilitating standardized evaluation across diverse benchmarks. We propose an extended BrowserGym-based ecosystem for web agent research, which unifies existing benchmarks from the literature and includes AgentLab, a complementary framework that aids in agent creation, testing, and analysis. Our proposed ecosystem offers flexibility for integrating new benchmarks while ensuring consistent evaluation and comprehensive experiment management. As a supporting evidence, we conduct the first large-scale, multi-benchmark web agent experiment and compare the performance of 6 state-of-the-art LLMs across 6 popular web agent benchmarks made available in BrowserGym. Among other findings, our results highlight a large discrepancy between OpenAI and Anthropic's latests models, with Claude-3.5-Sonnet leading the way on almost all benchmarks, except on vision-related tasks where GPT-4o is superior. Despite these advancements, our results emphasize that building robust and efficient web agents remains a significant challenge, due to the inherent complexity of real-world web environments and the limitations of current models.
Problem

Research questions and friction points this paper is trying to address.

Fragmentation and inconsistent evaluation methodologies in web agent benchmarks.
Need for a unified environment to standardize web agent evaluation.
Challenges in building robust web agents due to web complexity.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified gym-like environment for web agents
AgentLab framework for agent creation and testing
Large-scale multi-benchmark web agent experiments
🔎 Similar Papers
No similar papers found.
T
Thibault Le Sellier De Chezelles
ServiceNow Research, Mila, Polytechnique Montréal
Maxime Gasse
Maxime Gasse
ServiceNow Research, Mila, Polytechnique Montréal
Alexandre Lacoste
Alexandre Lacoste
Staff Research Scientist, ServiceNow Research
machine learning
Alexandre Drouin
Alexandre Drouin
Head of Frontier AI Research, ServiceNow Research - Adjunct Professor, ULaval, Mila
Machine LearningDeep LearningCausal InferenceComputational Biology
M
Massimo Caccia
ServiceNow Research
Léo Boisvert
Léo Boisvert
PhD student. Polytechnique Montréal, MILA
LLM-based AgentsAI Agent Security
Megh Thakkar
Megh Thakkar
MILA - Quebec AI Institute
Natural Language ProcessingDeep Learning
Tom Marty
Tom Marty
Mila : Montreal Institute of Learning Algorithms
Generative ModellingOOD robustnessMeta-learningDeep Learning
Rim Assouel
Rim Assouel
Mila, Université de Montréal
Deep Learning
S
Sahar Omidi Shayegan
ServiceNow Research, McGill University
L
Lawrence Jang
Carnegie Mellon University
Xing Han Lù
Xing Han Lù
PhD Student at McGill University; Mila
Natural Language ProcessingMachine Learning
Ori Yoran
Ori Yoran
Tel-Aviv University
D
Dehan Kong
iMean AI
F
Frank F. Xu
Carnegie Mellon University
Siva Reddy
Siva Reddy
McGill University, Mila Quebec AI Institute
Natural Language ProcessingComputational LinguisticsDeep LearningSemantics
Quentin Cappart
Quentin Cappart
Associate Professor at Polytechnique Montreal
Artificial intelligencecombinatorial optimizationconstraint programmingreinforcement learning
Graham Neubig
Graham Neubig
Carnegie Mellon University, All Hands AI
Natural Language ProcessingMachine LearningArtificial Intelligence
Ruslan Salakhutdinov
Ruslan Salakhutdinov
UPMC Professor, Machine Learning Department, CMU
Machine LearningArtificial IntelligenceDeep Learning
Nicolas Chapados
Nicolas Chapados
ServiceNow Research, Mila, Polytechnique Montréal (adjunct)
Deep LearningArtificial IntelligenceStatisticsForecasting