MAPLE: A Mobile Assistant with Persistent Finite State Machines for Recovery Reasoning

📅 2025-05-29

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

Existing mobile GUI agents respond solely to the current screen, lacking structured modeling of app navigation flows—resulting in weak contextual understanding and insufficient error detection and recovery capabilities. To address this, we propose a state-aware multi-agent mobile GUI assistant that introduces, for the first time, a lightweight, model-agnostic, persistent finite-state machine (FSM) as a state memory layer. This FSM dynamically models app navigation in real time, enabling closed-loop collaboration across task planning, execution, result verification, and rollback-based error recovery. Our approach integrates a multi-agent architecture, multimodal large language models (MLLMs), UI screen perception, and state-action mapping learning. Evaluated on Mobile-Eval-E and SPA-Bench, our method achieves +12.0% task success rate, +13.8% error recovery success rate, and +6.5% action accuracy over prior approaches.

Technology Category

Application Category

📝 Abstract

Mobile GUI agents aim to autonomously complete user-instructed tasks across mobile apps. Recent advances in Multimodal Large Language Models (MLLMs) enable these agents to interpret UI screens, identify actionable elements, and perform interactions such as tapping or typing. However, existing agents remain reactive: they reason only over the current screen and lack a structured model of app navigation flow, limiting their ability to understand context, detect unexpected outcomes, and recover from errors. We present MAPLE, a state-aware multi-agent framework that abstracts app interactions as a Finite State Machine (FSM). We computationally model each UI screen as a discrete state and user actions as transitions, allowing the FSM to provide a structured representation of the app execution. MAPLE consists of specialized agents responsible for four phases of task execution: planning, execution, verification, error recovery, and knowledge retention. These agents collaborate to dynamically construct FSMs in real time based on perception data extracted from the UI screen, allowing the GUI agents to track navigation progress and flow, validate action outcomes through pre- and post-conditions of the states, and recover from errors by rolling back to previously stable states. Our evaluation results on two challenging cross-app benchmarks, Mobile-Eval-E and SPA-Bench, show that MAPLE outperforms the state-of-the-art baseline, improving task success rate by up to 12%, recovery success by 13.8%, and action accuracy by 6.5%. Our results highlight the importance of structured state modeling in guiding mobile GUI agents during task execution. Moreover, our FSM representation can be integrated into future GUI agent architectures as a lightweight, model-agnostic memory layer to support structured planning, execution verification, and error recovery.

Problem

Research questions and friction points this paper is trying to address.

Mobile GUI agents lack structured app navigation flow modeling

Existing agents fail to detect errors and recover effectively

Need for state-aware framework to improve task success rates

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Finite State Machine for app navigation

Specialized agents for task execution phases

Dynamic FSM construction from UI perception

🔎 Similar Papers

Systematic Categorization, Construction and Evaluation of New Attacks against Multi-modal Mobile GUI Agents