🤖 AI Summary
Existing LLM-based web browsing agent benchmarks (e.g., WebArena, WebVoyager, REAL) evaluate agents under idealized conditions—stable connectivity, static pages, and benign environments—ignoring real-world non-deterministic perturbations such as network fluctuations, HTTPS interruptions, server errors, XSS attacks, and dynamic DOM mutations. Method: WAREX introduces the first systematic robustness evaluation framework tailored to real-world web complexity. It injects realistic client-side exceptions, server-side failures, malicious scripts, and structural perturbations into high-fidelity simulated environments to stress-test state-of-the-art browsing agents. Contribution/Results: Experiments reveal substantial drops in task success rates across leading agents, exposing critical reliability and security vulnerabilities. WAREX bridges the gap between idealized benchmarks and practical deployment by providing a reproducible, multi-dimensional evaluation standard—enabling rigorous assessment of robustness, resilience, and safety for next-generation web-integrated AI agents.
📝 Abstract
Recent advances in browser-based LLM agents have shown promise for automating tasks ranging from simple form filling to hotel booking or online shopping. Current benchmarks measure agent performance in controlled environments, such as containers or stable networks, where websites behave deterministically. However, in the real world, users access websites over networks and HTTPS connections that introduce instability from multiple sources: client-side, server-side issues or broader system failures. Moreover, live websites are prone to web attacks such Cross-Site Scripting, as well as general site modifications which can cause unexpected or malicious pop-ups or improper functionality. To address this gap, we present WAREX: Web Agent Reliability Evaluation on Existing Benchmarks. We measure the impact of WAREX across three popular benchmarks: WebArena, WebVoyager, and REAL. Our experiments show that introducing WAREX leads to significant drops in task success rates, highlighting the limited robustness of state-of-the-art agents.