🤖 AI Summary
Automated human-computer interaction via GUI agents remains challenging due to fragmented evaluation criteria, heterogeneous architectures, and insufficiently characterized capabilities of large-model-driven agents.
Method: We present the first unified capability framework for GUI agents, encompassing multimodal perception (OCR/VLM), neuro-symbolic reasoning, hierarchical task planning, and end-to-end reinforcement fine-tuning. We systematically classify and critically evaluate 15+ benchmarks, 30+ representative works, and eight architectural paradigms.
Contribution/Results: Our work establishes a comprehensive technical landscape, introducing a reusable capability benchmarking paradigm, standardized evaluation protocols, and a forward-looking roadmap. We explicitly identify six open challenges—spanning robustness, generalization, compositional reasoning, efficiency, explainability, and real-world deployment—and outline concrete future research directions. This synthesis bridges theoretical foundations with practical engineering insights, advancing the systematic development of intelligent GUI agents.
📝 Abstract
Graphical User Interface (GUI) agents, powered by Large Foundation Models, have emerged as a transformative approach to automating human-computer interaction. These agents autonomously interact with digital systems or software applications via GUIs, emulating human actions such as clicking, typing, and navigating visual elements across diverse platforms. Motivated by the growing interest and fundamental importance of GUI agents, we provide a comprehensive survey that categorizes their benchmarks, evaluation metrics, architectures, and training methods. We propose a unified framework that delineates their perception, reasoning, planning, and acting capabilities. Furthermore, we identify important open challenges and discuss key future directions. Finally, this work serves as a basis for practitioners and researchers to gain an intuitive understanding of current progress, techniques, benchmarks, and critical open problems that remain to be addressed.