🤖 AI Summary
To address poor generalization and cross-application adaptability of GUI interaction on mobile devices, this work proposes a multimodal large language model (MLLM)-based agent framework. Methodologically: (1) a flexible action space is designed to enable dynamic adaptation across heterogeneous applications; (2) a two-stage paradigm is introduced—exploration stage integrates human-in-the-loop and autonomous exploration to construct an updatable, structured UI semantic knowledge base, while deployment leverages retrieval-augmented generation (RAG) for precise task execution; (3) the framework tightly integrates MLLMs, GUI element parsing, vision–language joint representation learning, and dynamic knowledge base updating. Experiments demonstrate that our approach significantly outperforms state-of-the-art methods across multiple mobile GUI benchmarks. Notably, it achieves high accuracy and strong robustness in complex cross-app workflows—e.g., ticket booking → payment → screenshot—highlighting its practical viability for real-world mobile automation.
📝 Abstract
With the advancement of Multimodal Large Language Models (MLLM), LLM-driven visual agents are increasingly impacting software interfaces, particularly those with graphical user interfaces. This work introduces a novel LLM-based multimodal agent framework for mobile devices. This framework, capable of navigating mobile devices, emulates human-like interactions. Our agent constructs a flexible action space that enhances adaptability across various applications including parser, text and vision descriptions. The agent operates through two main phases: exploration and deployment. During the exploration phase, functionalities of user interface elements are documented either through agent-driven or manual explorations into a customized structured knowledge base. In the deployment phase, RAG technology enables efficient retrieval and update from this knowledge base, thereby empowering the agent to perform tasks effectively and accurately. This includes performing complex, multi-step operations across various applications, thereby demonstrating the framework's adaptability and precision in handling customized task workflows. Our experimental results across various benchmarks demonstrate the framework's superior performance, confirming its effectiveness in real-world scenarios. Our code will be open source soon.