AppCopilot: Toward General, Accurate, Long-Horizon, and Efficient Mobile Agent

📅 2025-09-02

📈 Citations: 0

✨ Influential: 0

career value

257K/year

🤖 AI Summary

Mobile intelligent agents face four key challenges in practical deployment: poor generalization, low screen interaction accuracy, difficulty executing long-horizon tasks, and inefficient operation on resource-constrained devices. To address these, we propose the first full-stack, multimodal mobile agent system specifically designed for on-device execution. Our approach innovatively integrates a multi-agent collaborative architecture, hierarchical task decomposition, and chain-of-thought reasoning, augmented with on-device optimization, resource-aware scheduling, and lightweight multimodal large language models. The system enables cross-application and cross-device task orchestration, high-precision screen localization, bilingual (Chinese–English) interaction, and voice-driven operation. Experiments demonstrate significant improvements over state-of-the-art methods in generalizability, interaction accuracy, long-task completion rate, and on-device inference efficiency—achieving stable, low-latency, and low-power operation even on entry-level smartphones.

Technology Category

Application Category

📝 Abstract

With the raid evolution of large language models and multimodal foundation models, the mobile-agent landscape has proliferated without converging on the fundamental challenges. This paper identifies four core problems that must be solved for mobile agents to deliver practical, scalable impact: (1) generalization across tasks, modalities, apps, and devices; (2) accuracy, specifically precise on-screen interaction and click targeting; (3) long-horizon capability for sustained, multi-step goals; and (4) efficiency, specifically high-performance runtime on resource-constrained devices. We present AppCopilot, a multimodal, multi-agent, general-purpose on-device assistant that operates across applications and constitutes a full-stack, closed-loop system from data to deployment. AppCopilot operationalizes this position through an end-to-end autonomous pipeline spanning data collection, training, deployment, high-quality and efficient inference, and mobile application development. At the model layer, it integrates multimodal foundation models with robust Chinese-English support. At the reasoning and control layer, it combines chain-of-thought reasoning, hierarchical task planning and decomposition, and multi-agent collaboration. At the execution layer, it enables user personalization and experiential adaptation, voice interaction, function calling, cross-app and cross-device orchestration, and comprehensive mobile app support. The system design incorporates profiling-driven optimization for latency, memory, and energy across heterogeneous hardware. Empirically, AppCopilot achieves significant improvements along all four dimensions: stronger generalization, higher-precision on-screen actions, more reliable long-horizon task completion, and faster, more resource-efficient runtime.

Problem

Research questions and friction points this paper is trying to address.

Achieving generalization across tasks, apps, and devices

Ensuring accurate on-screen interaction and click targeting

Enabling long-horizon capability for multi-step goals

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal foundation models with multilingual support

Hierarchical task planning and multi-agent collaboration

Profiling-driven optimization for resource efficiency

🔎 Similar Papers

Benchmarking Mobile Device Control Agents across Diverse Configurations