AppAgent v2: Advanced Agent for Flexible Mobile Interactions

📅 2024-08-05
🏛️ arXiv.org
📈 Citations: 38
Influential: 3
📄 PDF
🤖 AI Summary
To address poor generalization and cross-application adaptability of GUI interaction on mobile devices, this work proposes a multimodal large language model (MLLM)-based agent framework. Methodologically: (1) a flexible action space is designed to enable dynamic adaptation across heterogeneous applications; (2) a two-stage paradigm is introduced—exploration stage integrates human-in-the-loop and autonomous exploration to construct an updatable, structured UI semantic knowledge base, while deployment leverages retrieval-augmented generation (RAG) for precise task execution; (3) the framework tightly integrates MLLMs, GUI element parsing, vision–language joint representation learning, and dynamic knowledge base updating. Experiments demonstrate that our approach significantly outperforms state-of-the-art methods across multiple mobile GUI benchmarks. Notably, it achieves high accuracy and strong robustness in complex cross-app workflows—e.g., ticket booking → payment → screenshot—highlighting its practical viability for real-world mobile automation.

Technology Category

Application Category

📝 Abstract
With the advancement of Multimodal Large Language Models (MLLM), LLM-driven visual agents are increasingly impacting software interfaces, particularly those with graphical user interfaces. This work introduces a novel LLM-based multimodal agent framework for mobile devices. This framework, capable of navigating mobile devices, emulates human-like interactions. Our agent constructs a flexible action space that enhances adaptability across various applications including parser, text and vision descriptions. The agent operates through two main phases: exploration and deployment. During the exploration phase, functionalities of user interface elements are documented either through agent-driven or manual explorations into a customized structured knowledge base. In the deployment phase, RAG technology enables efficient retrieval and update from this knowledge base, thereby empowering the agent to perform tasks effectively and accurately. This includes performing complex, multi-step operations across various applications, thereby demonstrating the framework's adaptability and precision in handling customized task workflows. Our experimental results across various benchmarks demonstrate the framework's superior performance, confirming its effectiveness in real-world scenarios. Our code will be open source soon.
Problem

Research questions and friction points this paper is trying to address.

Develops a multimodal agent for mobile device navigation
Enables human-like interactions across diverse mobile applications
Performs complex multi-step operations through adaptive framework
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-based multimodal agent framework
Exploration and deployment phases
RAG technology for knowledge retrieval
🔎 Similar Papers
No similar papers found.
Y
Yanda Li
University of Technology Sydney
C
Chi Zhang
Westlake University
W
Wanqi Yang
University of Technology Sydney
B
Bin Fu
Tencent
P
Pei Cheng
Tencent
X
Xin Chen
Tencent
L
Ling Chen
University of Technology Sydney
Yunchao Wei
Yunchao Wei
Professor, Beijing Jiaotong University, UTS, UIUC, NUS
Computer VisionMachine Learning