GUI Agents: A Survey

📅 2024-12-18

🏛️ arXiv.org

📈 Citations: 28

✨ Influential: 2

career value

224K/year

🤖 AI Summary

Automated human-computer interaction via GUI agents remains challenging due to fragmented evaluation criteria, heterogeneous architectures, and insufficiently characterized capabilities of large-model-driven agents. Method: We present the first unified capability framework for GUI agents, encompassing multimodal perception (OCR/VLM), neuro-symbolic reasoning, hierarchical task planning, and end-to-end reinforcement fine-tuning. We systematically classify and critically evaluate 15+ benchmarks, 30+ representative works, and eight architectural paradigms. Contribution/Results: Our work establishes a comprehensive technical landscape, introducing a reusable capability benchmarking paradigm, standardized evaluation protocols, and a forward-looking roadmap. We explicitly identify six open challenges—spanning robustness, generalization, compositional reasoning, efficiency, explainability, and real-world deployment—and outline concrete future research directions. This synthesis bridges theoretical foundations with practical engineering insights, advancing the systematic development of intelligent GUI agents.

Technology Category

Application Category

📝 Abstract

Graphical User Interface (GUI) agents, powered by Large Foundation Models, have emerged as a transformative approach to automating human-computer interaction. These agents autonomously interact with digital systems or software applications via GUIs, emulating human actions such as clicking, typing, and navigating visual elements across diverse platforms. Motivated by the growing interest and fundamental importance of GUI agents, we provide a comprehensive survey that categorizes their benchmarks, evaluation metrics, architectures, and training methods. We propose a unified framework that delineates their perception, reasoning, planning, and acting capabilities. Furthermore, we identify important open challenges and discuss key future directions. Finally, this work serves as a basis for practitioners and researchers to gain an intuitive understanding of current progress, techniques, benchmarks, and critical open problems that remain to be addressed.

Problem

Research questions and friction points this paper is trying to address.

Surveying GUI agents' benchmarks, metrics, architectures, and training methods

Proposing unified framework for perception, reasoning, planning, and acting capabilities

Identifying open challenges and future directions for autonomous human-computer interaction

Innovation

Methods, ideas, or system contributions that make the work stand out.

GUI agents powered by Large Foundation Models

Unified framework for perception, reasoning, planning, acting

Automating human-computer interaction across diverse platforms

🔎 Similar Papers

Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents