Are Autonomous Web Agents Good Testers?

📅 2025-04-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the fragility and high maintenance cost of conventional automated test scripts by investigating the feasibility of large language model (LLM)-driven autonomous web agents (AWAs) as low-maintenance automated testing agents (ATAs). We propose transforming AWAs into autonomous test agents capable of interpreting natural-language instructions, executing test cases, verifying semantic assertions, and producing pass/fail judgments. Key contributions include: (1) the first offline web testing benchmark tailored for ATAs, comprising 113 manually curated test cases; (2) open-sourcing two frameworks—SeeAct-ATA and pinATA; and (3) the first systematic evaluation of ATAs’ effectiveness and limitations in realistic scenarios. Experimental results show that pinATA achieves 60% accuracy and 94% specificity—outperforming SeeAct-ATA by 50%—while qualitative analysis identifies critical bottlenecks in handling dynamic content, evaluating semantic assertions, and recovering from anomalies.

Technology Category

Application Category

📝 Abstract
Despite advances in automated testing, manual testing remains prevalent due to the high maintenance demands associated with test script fragility-scripts often break with minor changes in application structure. Recent developments in Large Language Models (LLMs) offer a potential alternative by powering Autonomous Web Agents (AWAs) that can autonomously interact with applications. These agents may serve as Autonomous Test Agents (ATAs), potentially reducing the need for maintenance-heavy automated scripts by utilising natural language instructions similar to those used by human testers. This paper investigates the feasibility of adapting AWAs for natural language test case execution and how to evaluate them. We contribute with (1) a benchmark of three offline web applications, and a suite of 113 manual test cases, split between passing and failing cases, to evaluate and compare ATAs performance, (2) SeeAct-ATA and pinATA, two open-source ATA implementations capable of executing test steps, verifying assertions and giving verdicts, and (3) comparative experiments using our benchmark that quantifies our ATAs effectiveness. Finally we also proceed to a qualitative evaluation to identify the limitations of PinATA, our best performing implementation. Our findings reveal that our simple implementation, SeeAct-ATA, does not perform well compared to our more advanced PinATA implementation when executing test cases (50% performance improvement). However, while PinATA obtains around 60% of correct verdict and up to a promising 94% specificity, we identify several limitations that need to be addressed to develop more resilient and reliable ATAs, paving the way for robust, low maintenance test automation. CCS Concepts: $ullet$ Software and its engineering $ ightarrow$ Software testing and debugging.
Problem

Research questions and friction points this paper is trying to address.

Investigates feasibility of Autonomous Web Agents for test automation
Evaluates performance of Autonomous Test Agents using benchmarks
Identifies limitations to improve resilience of test agents
Innovation

Methods, ideas, or system contributions that make the work stand out.

Autonomous Test Agents reduce script maintenance
Natural language instructions enable human-like testing
Benchmark and open-source implementations evaluate ATA performance
🔎 Similar Papers
No similar papers found.