AppCopilot: Toward General, Accurate, Long-Horizon, and Efficient Mobile Agent

📅 2025-09-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Mobile intelligent agents face four key challenges in practical deployment: poor generalization, low screen interaction accuracy, difficulty executing long-horizon tasks, and inefficient operation on resource-constrained devices. To address these, we propose the first full-stack, multimodal mobile agent system specifically designed for on-device execution. Our approach innovatively integrates a multi-agent collaborative architecture, hierarchical task decomposition, and chain-of-thought reasoning, augmented with on-device optimization, resource-aware scheduling, and lightweight multimodal large language models. The system enables cross-application and cross-device task orchestration, high-precision screen localization, bilingual (Chinese–English) interaction, and voice-driven operation. Experiments demonstrate significant improvements over state-of-the-art methods in generalizability, interaction accuracy, long-task completion rate, and on-device inference efficiency—achieving stable, low-latency, and low-power operation even on entry-level smartphones.

Technology Category

Application Category

📝 Abstract
With the raid evolution of large language models and multimodal foundation models, the mobile-agent landscape has proliferated without converging on the fundamental challenges. This paper identifies four core problems that must be solved for mobile agents to deliver practical, scalable impact: (1) generalization across tasks, modalities, apps, and devices; (2) accuracy, specifically precise on-screen interaction and click targeting; (3) long-horizon capability for sustained, multi-step goals; and (4) efficiency, specifically high-performance runtime on resource-constrained devices. We present AppCopilot, a multimodal, multi-agent, general-purpose on-device assistant that operates across applications and constitutes a full-stack, closed-loop system from data to deployment. AppCopilot operationalizes this position through an end-to-end autonomous pipeline spanning data collection, training, deployment, high-quality and efficient inference, and mobile application development. At the model layer, it integrates multimodal foundation models with robust Chinese-English support. At the reasoning and control layer, it combines chain-of-thought reasoning, hierarchical task planning and decomposition, and multi-agent collaboration. At the execution layer, it enables user personalization and experiential adaptation, voice interaction, function calling, cross-app and cross-device orchestration, and comprehensive mobile app support. The system design incorporates profiling-driven optimization for latency, memory, and energy across heterogeneous hardware. Empirically, AppCopilot achieves significant improvements along all four dimensions: stronger generalization, higher-precision on-screen actions, more reliable long-horizon task completion, and faster, more resource-efficient runtime.
Problem

Research questions and friction points this paper is trying to address.

Achieving generalization across tasks, apps, and devices
Ensuring accurate on-screen interaction and click targeting
Enabling long-horizon capability for multi-step goals
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal foundation models with multilingual support
Hierarchical task planning and multi-agent collaboration
Profiling-driven optimization for resource efficiency
🔎 Similar Papers
No similar papers found.
J
Jingru Fan
School of Artificial Intelligence, Shanghai Jiao Tong University
Yufan Dang
Yufan Dang
Tsinghua University
Natural Language ProcessingMachine LearningArtificial Intelligence
Jingyao Wu
Jingyao Wu
MIT-Novo Nordisk AI Postdoctoral Fellow, MIT Media Lab
emotion recognitionaffective computingmachine learningspeech processingtime series analysis
H
Huatao Li
School of Artificial Intelligence, Shanghai Jiao Tong University
R
Runde Yang
School of Artificial Intelligence, Shanghai Jiao Tong University
Xiyuan Yang
Xiyuan Yang
UIUC
Trustworthy Machine Learning
Y
Yuheng Wang
School of Artificial Intelligence, Shanghai Jiao Tong University
Zhong Zhang
Zhong Zhang
Tsinghua University
Large Language ModelsLLM AgentsNatural Language Processing
Y
Yaxi Lu
Department of Computer Science and Technology, Tsinghua University
Yankai Lin
Yankai Lin
Associate Professor (Tenure Track), Gaoling School of AI, Renmin University of China
Natural Language ProcessingLarge Language Models
Z
Zhiyuan Liu
Department of Computer Science and Technology, Tsinghua University
D
Dahai Li
Modelbest Inc.
C
Chen Qian
School of Artificial Intelligence, Shanghai Jiao Tong University