LLM-Powered GUI Agents in Phone Automation: Surveying Progress and Prospects

📅 2025-04-28
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses key bottlenecks in existing mobile GUI automation systems—limited generalizability, high maintenance overhead, and weak intent understanding—by presenting the first systematic survey and framework design. Methodologically, it establishes a unified taxonomy encompassing single/multi-agent architectures, plan-and-execute paradigms, prompt engineering techniques, and training strategies; and proposes an end-to-end LLM-based agent framework integrating multimodal perception, GUI state modeling, supervised fine-tuning, and reinforcement learning. Key contributions include: (1) formally identifying the modeling gap between user intent and GUI actions; (2) constructing a structured technology landscape and open-challenge inventory; and (3) defining a standardized evaluation benchmark. The work provides an authoritative reference and practical guidance for developing scalable, secure, and user-friendly LLM-driven GUI agents.

Technology Category

Application Category

📝 Abstract
With the rapid rise of large language models (LLMs), phone automation has undergone transformative changes. This paper systematically reviews LLM-driven phone GUI agents, highlighting their evolution from script-based automation to intelligent, adaptive systems. We first contextualize key challenges, (i) limited generality, (ii) high maintenance overhead, and (iii) weak intent comprehension, and show how LLMs address these issues through advanced language understanding, multimodal perception, and robust decision-making. We then propose a taxonomy covering fundamental agent frameworks (single-agent, multi-agent, plan-then-act), modeling approaches (prompt engineering, training-based), and essential datasets and benchmarks. Furthermore, we detail task-specific architectures, supervised fine-tuning, and reinforcement learning strategies that bridge user intent and GUI operations. Finally, we discuss open challenges such as dataset diversity, on-device deployment efficiency, user-centric adaptation, and security concerns, offering forward-looking insights into this rapidly evolving field. By providing a structured overview and identifying pressing research gaps, this paper serves as a definitive reference for researchers and practitioners seeking to harness LLMs in designing scalable, user-friendly phone GUI agents.
Problem

Research questions and friction points this paper is trying to address.

Addressing limited generality in phone automation
Reducing high maintenance overhead in GUI agents
Improving weak intent comprehension via LLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-driven adaptive phone GUI agents
Multimodal perception for intent comprehension
Reinforcement learning for GUI operations
🔎 Similar Papers
No similar papers found.
G
Guangyi Liu
Zhejiang University
Pengxiang Zhao
Pengxiang Zhao
Zhejiang university
LLMAI Agent
L
Liang Liu
vivo AI Lab
Yaxuan Guo
Yaxuan Guo
vivo AI Lab
UI AgentMobile Agent
H
Han Xiao
CUHK MMLab
Weifeng Lin
Weifeng Lin
The Chinese University of Hong Kong
Deep LearningComputer Vision
Yuxiang Chai
Yuxiang Chai
The Chinese University of Hong Kong
Computer VisionLLMAgent
Y
Yue Han
Zhejiang University
S
Shuai Ren
vivo AI Lab
H
Hao Wang
Zhejiang University
Xiaoyu Liang
Xiaoyu Liang
Tsinghua University
CO2 Conversion
W
Wenhao Wang
Zhejiang University
T
Tianze Wu
Zhejiang University
L
Linghao Li
Zhejiang University
G
Guanjing Xiong
vivo AI Lab
Y
Yong Liu
Zhejiang University
H
Hongsheng Li
CUHK MMLab