GUI Agents: A Survey

📅 2024-12-18
🏛️ arXiv.org
📈 Citations: 28
Influential: 2
📄 PDF
🤖 AI Summary
Automated human-computer interaction via GUI agents remains challenging due to fragmented evaluation criteria, heterogeneous architectures, and insufficiently characterized capabilities of large-model-driven agents. Method: We present the first unified capability framework for GUI agents, encompassing multimodal perception (OCR/VLM), neuro-symbolic reasoning, hierarchical task planning, and end-to-end reinforcement fine-tuning. We systematically classify and critically evaluate 15+ benchmarks, 30+ representative works, and eight architectural paradigms. Contribution/Results: Our work establishes a comprehensive technical landscape, introducing a reusable capability benchmarking paradigm, standardized evaluation protocols, and a forward-looking roadmap. We explicitly identify six open challenges—spanning robustness, generalization, compositional reasoning, efficiency, explainability, and real-world deployment—and outline concrete future research directions. This synthesis bridges theoretical foundations with practical engineering insights, advancing the systematic development of intelligent GUI agents.

Technology Category

Application Category

📝 Abstract
Graphical User Interface (GUI) agents, powered by Large Foundation Models, have emerged as a transformative approach to automating human-computer interaction. These agents autonomously interact with digital systems or software applications via GUIs, emulating human actions such as clicking, typing, and navigating visual elements across diverse platforms. Motivated by the growing interest and fundamental importance of GUI agents, we provide a comprehensive survey that categorizes their benchmarks, evaluation metrics, architectures, and training methods. We propose a unified framework that delineates their perception, reasoning, planning, and acting capabilities. Furthermore, we identify important open challenges and discuss key future directions. Finally, this work serves as a basis for practitioners and researchers to gain an intuitive understanding of current progress, techniques, benchmarks, and critical open problems that remain to be addressed.
Problem

Research questions and friction points this paper is trying to address.

Surveying GUI agents' benchmarks, metrics, architectures, and training methods
Proposing unified framework for perception, reasoning, planning, and acting capabilities
Identifying open challenges and future directions for autonomous human-computer interaction
Innovation

Methods, ideas, or system contributions that make the work stand out.

GUI agents powered by Large Foundation Models
Unified framework for perception, reasoning, planning, acting
Automating human-computer interaction across diverse platforms
🔎 Similar Papers
No similar papers found.
D
Dang Nguyen
University of Maryland
J
Jian Chen
State University of New York at Buffalo
Y
Yu Wang
University of Oregon
G
Gang Wu
Adobe Research
Namyong Park
Namyong Park
Meta AI
Machine LearningRepresentation LearningGraph LearningKnowledge ReasoningComplex Networks
Zhengmian Hu
Zhengmian Hu
Adobe Research
Deep LearningMonte Carlo
Hanjia Lyu
Hanjia Lyu
University of Rochester
AI and SocietyMultimodal LLMsGraph LearningComputational Social ScienceHealth Informatics
Junda Wu
Junda Wu
University of California San Diego
Natural Language ProcessingRecommender SystemMultimodal LearningReinforcement Learning
R
Ryan Aponte
Carnegie Mellon University
Y
Yu Xia
University of California, San Diego
X
Xintong Li
University of California, San Diego
J
Jing Shi
Adobe Research
H
Hongjie Chen
Dolby Labs
V
Viet Dac Lai
Adobe Research
Zhouhang Xie
Zhouhang Xie
University of California, San Diego
natural language processingmachine learningrecommender systems
Sungchul Kim
Sungchul Kim
Adobe
Data miningMachine learningBioinformatics
R
Ruiyi Zhang
Adobe Research
Tong Yu
Tong Yu
Adobe Research
M
Mehrav Tanjim
Adobe Research
Nesreen K. Ahmed
Nesreen K. Ahmed
Senior Principal Scientist, Cisco AI Research, Intel Labs, Purdue University
Geometric Deep LearningGraph Representation LearningML for SystemsML4code
P
Puneet Mathur
Adobe Research
Seunghyun Yoon
Seunghyun Yoon
Assistant Professor, Korea Institute of Energy Technology (KENTECH)
Reinforcement LearningDeep LearningData ScienceNetworkingCyber Security
Lina Yao
Lina Yao
Science Lead at CSIRO Data61 & Professor at University of New South Wales, Australia
Machine LearningReinforcement LearningRecommender SystemsLLM AgentBrain Computer Interface
Branislav Kveton
Branislav Kveton
Adobe Research
Artificial IntelligenceMachine Learning
Thien Huu Nguyen
Thien Huu Nguyen
University of Oregon
Information ExtractionDeep LearningNatural Language ProcessingMachine Learning
T
Trung Bui
Adobe Research
T
Tianyi Zhou
University of Maryland
Ryan A. Rossi
Ryan A. Rossi
Adobe Research
Machine LearningPersonalizationGraph Representation LearningGraph MLGraph Theory
Franck Dernoncourt
Franck Dernoncourt
NLP/ML Researcher. MIT PhD.
Machine LearningNeural NetworksNatural Language Processing