Building a Stable Planner: An Extended Finite State Machine Based Planning Module for Mobile GUI Agent

📅 2025-05-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Mobile GUI agents frequently fail during task execution due to insufficient understanding of application-level semantics. To address this, we propose SPlanner—the first stable task planning module specifically designed for mobile GUI agents. Our approach comprises three key contributions: (1) We introduce Extended Finite State Machines (EFSMs) to model application control logic, enabling interpretable and formally verifiable semantic modeling at the application level; (2) We design a plug-and-play planning architecture that decouples LLM-generated natural-language action plans from VLM-executed GUI interactions, supporting seamless integration of arbitrary vision-language models; (3) We integrate GUI state parsing with functional abstraction to decompose user instructions into executable functional sequences. Evaluated on the AndroidWorld benchmark, SPlanner paired with Qwen2.5-VL-72B achieves a 63.8% task success rate—representing a 28.8 percentage-point improvement over unguided baselines—and significantly enhances planning stability and generalization in dynamic environments.

Technology Category

Application Category

📝 Abstract
Mobile GUI agents execute user commands by directly interacting with the graphical user interface (GUI) of mobile devices, demonstrating significant potential to enhance user convenience. However, these agents face considerable challenges in task planning, as they must continuously analyze the GUI and generate operation instructions step by step. This process often leads to difficulties in making accurate task plans, as GUI agents lack a deep understanding of how to effectively use the target applications, which can cause them to become"lost"during task execution. To address the task planning issue, we propose SPlanner, a plug-and-play planning module to generate execution plans that guide vision language model(VLMs) in executing tasks. The proposed planning module utilizes extended finite state machines (EFSMs) to model the control logits and configurations of mobile applications. It then decomposes a user instruction into a sequence of primary function modeled in EFSMs, and generate the execution path by traversing the EFSMs. We further refine the execution path into a natural language plan using an LLM. The final plan is concise and actionable, and effectively guides VLMs to generate interactive GUI actions to accomplish user tasks. SPlanner demonstrates strong performance on dynamic benchmarks reflecting real-world mobile usage. On the AndroidWorld benchmark, SPlanner achieves a 63.8% task success rate when paired with Qwen2.5-VL-72B as the VLM executor, yielding a 28.8 percentage point improvement compared to using Qwen2.5-VL-72B without planning assistance.
Problem

Research questions and friction points this paper is trying to address.

Mobile GUI agents struggle with accurate task planning due to limited app understanding.
Existing methods lack effective modeling of mobile app control logic.
Vision language models need better guidance for executing GUI actions.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Extended Finite State Machines (EFSMs) for modeling
Decomposes user instructions into EFSM-based functions
Refines execution path into natural language plan
🔎 Similar Papers
No similar papers found.
F
Fanglin Mo
School of Computer Science & Engineering, South China University of Technology
Junzhe Chen
Junzhe Chen
Tsinghua University
Natural Language Processing
H
Haoxuan Zhu
School of Computer Science & Engineering, South China University of Technology
Xuming Hu
Xuming Hu
Assistant Professor, HKUST(GZ) / HKUST
Natural Language ProcessingLarge Language Model