REAL: Benchmarking Autonomous Agents on Deterministic Simulations of Real Websites

📅 2025-04-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the challenge of evaluating and improving autonomous agents on multi-step, real-world web interaction tasks requiring precise information retrieval and state modification. To this end, we introduce REAL, the first high-fidelity, deterministic web simulation benchmark—comprising 11 mainstream websites and 112 fine-grained, multi-step tasks—enabling safe, reproducible, non-intrusive black-box evaluation. We propose a novel evaluation framework integrating programmatic state verification with LLM-guided scoring, ensuring fair cross-model and cross-architecture comparison. Empirical results show that state-of-the-art large language models achieve a maximum success rate of only 41% on REAL, exposing fundamental deficiencies in navigation robustness and task reliability. The framework is open-sourced to facilitate task extension and large-scale training data generation.

Technology Category

Application Category

📝 Abstract
We introduce REAL, a benchmark and framework for multi-turn agent evaluations on deterministic simulations of real-world websites. REAL comprises high-fidelity, deterministic replicas of 11 widely-used websites across domains such as e-commerce, travel, communication, and professional networking. We also release a benchmark consisting of 112 practical tasks that mirror everyday complex user interactions requiring both accurate information retrieval and state-changing actions. All interactions occur within this fully controlled setting, eliminating safety risks and enabling robust, reproducible evaluation of agent capability and reliability. Our novel evaluation framework combines programmatic checks of website state for action-based tasks with rubric-guided LLM-based judgments for information retrieval. The framework supports both open-source and proprietary agent systems through a flexible evaluation harness that accommodates black-box commands within browser environments, allowing research labs to test agentic systems without modification. Our empirical results show that frontier language models achieve at most a 41% success rate on REAL, highlighting critical gaps in autonomous web navigation and task completion capabilities. Our framework supports easy integration of new tasks, reproducible evaluation, and scalable data generation for training web agents. The websites, framework, and leaderboard are available at https://realevals.xyz and https://github.com/agi-inc/REAL.
Problem

Research questions and friction points this paper is trying to address.

Benchmarking autonomous agents on real-world website simulations
Evaluating multi-turn agent interactions for complex user tasks
Assessing agent capabilities in web navigation and task completion
Innovation

Methods, ideas, or system contributions that make the work stand out.

Deterministic simulations of real-world websites
Programmatic checks and LLM-based judgments
Flexible evaluation harness for agent systems
🔎 Similar Papers
No similar papers found.
D
Divyansh Garg
The AGI Company
S
Shaun VanWeelden
Mercor
D
Diego Caples
The AGI Company
A
Andis Draguns
Contramont Research
N
Nikil Ravi
Stanford University
P
Pranav Putta
Plato
N
Naman Garg
The AGI Company
T
Tomas Abraham
Independent
M
Michael Lara
Independent
Federico Lopez
Federico Lopez
Kumo AI
Geometric Deep LearningGraphs
J
James Liu
The AGI Company
Atharva Gundawar
Atharva Gundawar
Student of Artificial Intelligence, Arizona State University
Artificial Intelligence
P
Prannay Hebbar
The AGI Company
Y
Youngchul Joo
The AGI Company
Charles London
Charles London
DPhil Student in CS, University of Oxford
machine learninglearning theorydeep learningstatistics
Christian Schroeder de Witt
Christian Schroeder de Witt
University of Oxford
Multi-agent LearningSecuritySafety
Sumeet Motwani
Sumeet Motwani
University of Oxford
Machine Learning