Safeguarding Mobile GUI Agent via Logic-based Action Verification

📅 2025-03-24

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

Large foundation model (LFM)-driven mobile GUI agents suffer from action deviations due to probabilistic LFM outputs and ambiguous task specifications, leading to misaligned user intent. Method: This paper introduces VeriSafe Agent (VSA), the first framework to integrate formal verification into mobile GUI agents. VSA automatically compiles natural-language instructions into verifiable domain-specific language (DSL) specifications and enforces rigorous intent-consistency checks at each action step via a rule-driven runtime verification engine coupled with logical constraint solving. Contribution/Results: By synergistically combining commercial LFMs (e.g., GPT-4o) with a custom-built verification infrastructure, VSA achieves 94.3%–98.33% action accuracy across 18 mainstream mobile applications and 300 real-world user instructions—outperforming existing LLM-based verification approaches by 20.4%–25.6% in accuracy and by 90%–130% in task completion rate.

Technology Category

Application Category

📝 Abstract

Large Foundation Models (LFMs) have unlocked new possibilities in human-computer interaction, particularly with the rise of mobile Graphical User Interface (GUI) Agents capable of interpreting GUIs. These agents promise to revolutionize mobile computing by allowing users to automate complex mobile tasks through simple natural language instructions. However, the inherent probabilistic nature of LFMs, coupled with the ambiguity and context-dependence of mobile tasks, makes LFM-based automation unreliable and prone to errors. To address this critical challenge, we introduce VeriSafe Agent (VSA): a formal verification system that serves as a logically grounded safeguard for Mobile GUI Agents. VSA is designed to deterministically ensure that an agent's actions strictly align with user intent before conducting an action. At its core, VSA introduces a novel autoformalization technique that translates natural language user instructions into a formally verifiable specification, expressed in our domain-specific language (DSL). This enables runtime, rule-based verification, allowing VSA to detect and prevent erroneous actions executing an action, either by providing corrective feedback or halting unsafe behavior. To the best of our knowledge, VSA is the first attempt to bring the rigor of formal verification to GUI agent. effectively bridging the gap between LFM-driven automation and formal software verification. We implement VSA using off-the-shelf LLM services (GPT-4o) and evaluate its performance on 300 user instructions across 18 widely used mobile apps. The results demonstrate that VSA achieves 94.3%-98.33% accuracy in verifying agent actions, representing a significant 20.4%-25.6% improvement over existing LLM-based verification methods, and consequently increases the GUI agent's task completion rate by 90%-130%.

Problem

Research questions and friction points this paper is trying to address.

Ensures mobile GUI agent actions align with user intent

Translates natural language instructions into verifiable specifications

Improves accuracy and safety of LFM-based mobile automation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Logic-based action verification for GUI agents

Autoformalization of natural language instructions

Runtime rule-based verification system

🔎 Similar Papers

Systematic Categorization, Construction and Evaluation of New Attacks against Multi-modal Mobile GUI Agents