WorldGUI: Dynamic Testing for Comprehensive Desktop GUI Automation

📅 2025-02-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing GUI agents exhibit strong element localization capabilities but suffer from high sensitivity to environmental initial states—such as unlaunched applications or non-default UI configurations—leading to frequent planning failures in real-world scenarios; meanwhile, prevailing benchmarks lack systematic evaluation of such state uncertainty. To address this, we propose WorldGUI, the first desktop GUI dynamic testing benchmark explicitly designed for initial-state diversity, covering ten mainstream applications including PowerPoint and VSCode. We further introduce GUI-Thinker, a novel framework featuring cross-application state modeling, multi-stage reasoning, and a critical reflection mechanism to enhance planning robustness. Evaluated on WorldGUI, GUI-Thinker achieves an 82.3% task success rate, outperforming Claude-3.5 (Computer Use) by 14.9%, thereby significantly mitigating the environment-sensitivity bottleneck in GUI agent deployment.

Technology Category

Application Category

📝 Abstract
Current GUI agents have achieved outstanding performance in GUI element grounding. However, planning remains highly challenging, especially due to sensitivity to the initial state of the environment. Specifically, slight differences in the initial state-such as the target software not being open or the interface not being in its default state-often lead to planning errors. This issue is widespread in real user scenarios, but existing benchmarks fail to evaluate it. In this paper, we present WorldGUI, a novel GUI benchmark that designs GUI tasks with various initial states to simulate real computer-user interactions. The benchmark spans a wide range of tasks across 10 popular software applications, including PowerPoint, VSCode, and Adobe Acrobat. In addition, to address the challenges of dynamic GUI automation tasks, we propose GUI-Thinker, a holistic framework, leveraging a critique mechanism, that effectively manages the unpredictability and complexity of GUI interactions. Experimental results demonstrate that GUI-Thinker significantly outperforms Claude-3.5 (Computer Use) by 14.9% in success rate on WorldGUI tasks. This improvement underscores the effectiveness of our critical-thinking-based framework in enhancing GUI automation.
Problem

Research questions and friction points this paper is trying to address.

Addresses GUI planning errors
Simulates real user interactions
Enhances dynamic GUI automation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic initial state simulation
Holistic GUI automation framework
Critique mechanism for unpredictability
🔎 Similar Papers
No similar papers found.