WAREX: Web Agent Reliability Evaluation on Existing Benchmarks

📅 2025-09-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing LLM-based web browsing agent benchmarks (e.g., WebArena, WebVoyager, REAL) evaluate agents under idealized conditions—stable connectivity, static pages, and benign environments—ignoring real-world non-deterministic perturbations such as network fluctuations, HTTPS interruptions, server errors, XSS attacks, and dynamic DOM mutations. Method: WAREX introduces the first systematic robustness evaluation framework tailored to real-world web complexity. It injects realistic client-side exceptions, server-side failures, malicious scripts, and structural perturbations into high-fidelity simulated environments to stress-test state-of-the-art browsing agents. Contribution/Results: Experiments reveal substantial drops in task success rates across leading agents, exposing critical reliability and security vulnerabilities. WAREX bridges the gap between idealized benchmarks and practical deployment by providing a reproducible, multi-dimensional evaluation standard—enabling rigorous assessment of robustness, resilience, and safety for next-generation web-integrated AI agents.

Technology Category

Application Category

📝 Abstract
Recent advances in browser-based LLM agents have shown promise for automating tasks ranging from simple form filling to hotel booking or online shopping. Current benchmarks measure agent performance in controlled environments, such as containers or stable networks, where websites behave deterministically. However, in the real world, users access websites over networks and HTTPS connections that introduce instability from multiple sources: client-side, server-side issues or broader system failures. Moreover, live websites are prone to web attacks such Cross-Site Scripting, as well as general site modifications which can cause unexpected or malicious pop-ups or improper functionality. To address this gap, we present WAREX: Web Agent Reliability Evaluation on Existing Benchmarks. We measure the impact of WAREX across three popular benchmarks: WebArena, WebVoyager, and REAL. Our experiments show that introducing WAREX leads to significant drops in task success rates, highlighting the limited robustness of state-of-the-art agents.
Problem

Research questions and friction points this paper is trying to address.

Evaluating web agent reliability under real-world network instability
Assessing agent robustness against web attacks and site modifications
Testing agent performance degradation across existing benchmark environments
Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluates web agents under real-world network instability
Tests agent resilience against web attacks and modifications
Assesses reliability on existing benchmarks with significant drops
🔎 Similar Papers
No similar papers found.
S
Su Kara
Department of Computer Science, Stanford University
F
Fazle Faisal
Microsoft Research, Redmond, USA
Suman Nath
Suman Nath
Principal Researcher, Microsoft Research
Cloud ReliabilityMobile systemSensor networks