🤖 AI Summary
Mobile GUI agents frequently fail during task execution due to insufficient understanding of application-level semantics. To address this, we propose SPlanner—the first stable task planning module specifically designed for mobile GUI agents. Our approach comprises three key contributions: (1) We introduce Extended Finite State Machines (EFSMs) to model application control logic, enabling interpretable and formally verifiable semantic modeling at the application level; (2) We design a plug-and-play planning architecture that decouples LLM-generated natural-language action plans from VLM-executed GUI interactions, supporting seamless integration of arbitrary vision-language models; (3) We integrate GUI state parsing with functional abstraction to decompose user instructions into executable functional sequences. Evaluated on the AndroidWorld benchmark, SPlanner paired with Qwen2.5-VL-72B achieves a 63.8% task success rate—representing a 28.8 percentage-point improvement over unguided baselines—and significantly enhances planning stability and generalization in dynamic environments.
📝 Abstract
Mobile GUI agents execute user commands by directly interacting with the graphical user interface (GUI) of mobile devices, demonstrating significant potential to enhance user convenience. However, these agents face considerable challenges in task planning, as they must continuously analyze the GUI and generate operation instructions step by step. This process often leads to difficulties in making accurate task plans, as GUI agents lack a deep understanding of how to effectively use the target applications, which can cause them to become"lost"during task execution. To address the task planning issue, we propose SPlanner, a plug-and-play planning module to generate execution plans that guide vision language model(VLMs) in executing tasks. The proposed planning module utilizes extended finite state machines (EFSMs) to model the control logits and configurations of mobile applications. It then decomposes a user instruction into a sequence of primary function modeled in EFSMs, and generate the execution path by traversing the EFSMs. We further refine the execution path into a natural language plan using an LLM. The final plan is concise and actionable, and effectively guides VLMs to generate interactive GUI actions to accomplish user tasks. SPlanner demonstrates strong performance on dynamic benchmarks reflecting real-world mobile usage. On the AndroidWorld benchmark, SPlanner achieves a 63.8% task success rate when paired with Qwen2.5-VL-72B as the VLM executor, yielding a 28.8 percentage point improvement compared to using Qwen2.5-VL-72B without planning assistance.