GUITester: Enabling GUI Agents for Exploratory Defect Discovery

📅 2026-01-08

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

149K/year

🤖 AI Summary

Existing agents in exploratory GUI testing are hindered by goal-oriented masking and misattribution of execution errors, limiting their effectiveness in defect discovery. This work proposes GUITester, a novel multi-agent framework that decouples navigation from verification through a dedicated architecture. It employs a planning-execution module to proactively probe for defects and integrates a hierarchical reflection module to analyze interaction histories for precise error attribution. Furthermore, the study introduces GUITestBench, the first interactive benchmark specifically designed for GUI defect detection. Experimental results demonstrate that GUITester achieves an F1-score of 48.90% (Pass@3) on this benchmark, substantially outperforming the current state-of-the-art baseline, which attains only 33.35%.

Technology Category

Application Category

📝 Abstract

Exploratory GUI testing is essential for software quality but suffers from high manual costs. While Multi-modal Large Language Model (MLLM) agents excel in navigation, they fail to autonomously discover defects due to two core challenges: \textit{Goal-Oriented Masking}, where agents prioritize task completion over reporting anomalies, and \textit{Execution-Bias Attribution}, where system defects are misidentified as agent errors. To address these, we first introduce \textbf{GUITestBench}, the first interactive benchmark for this task, featuring 143 tasks across 26 defects. We then propose \textbf{GUITester}, a multi-agent framework that decouples navigation from verification via two modules: (i) a \textit{Planning-Execution Module (PEM)} that proactively probes for defects via embedded testing intents, and (ii) a \textit{Hierarchical Reflection Module (HRM)} that resolves attribution ambiguity through interaction history analysis. GUITester achieves an F1-score of 48.90\% (Pass@3) on GUITestBench, outperforming state-of-the-art baselines (33.35\%). Our work demonstrates the feasibility of autonomous exploratory testing and provides a robust foundation for future GUI quality assurance~\footnote{Our code is now available in~\href{https://github.com/ADaM-BJTU/GUITestBench}{https://github.com/ADaM-BJTU/GUITestBench}}.

Problem

Research questions and friction points this paper is trying to address.

Exploratory GUI testing

Defect discovery

Goal-Oriented Masking

Execution-Bias Attribution

Autonomous testing

Innovation

Methods, ideas, or system contributions that make the work stand out.

GUITester

exploratory testing

multi-agent framework