🤖 AI Summary
Cloud-based large language models (LLMs) suffer from high inference latency and cost, whereas edge-finetuned models exhibit limited generalization and struggle with complex tasks. Method: This paper proposes an edge-cloud collaborative multi-agent framework comprising a cloud-based planning agent and edge-based execution/observation agents operating in a closed loop. The observation agent introduces a novel pre-understanding module that efficiently compresses screen images into semantic text; a history-augmented reflection mechanism enables dynamic re-planning; and an edge-cloud co-scheduling strategy optimizes task allocation. Contribution/Results: This work achieves, for the first time on mobile devices, the organic integration of multimodal LLM (MLLM) capabilities with edge efficiency. Evaluated on AndroidWorld, the framework maintains high task success rates while significantly reducing MLLM token consumption—thereby enhancing both the practicality and deployment efficiency of mobile automation.
📝 Abstract
Cloud-based mobile agents powered by (multimodal) large language models ((M)LLMs) offer strong reasoning abilities but suffer from high latency and cost. While fine-tuned (M)SLMs enable edge deployment, they often lose general capabilities and struggle with complex tasks. To address this, we propose EcoAgent, an Edge-Cloud cOllaborative multi-agent framework for mobile automation. EcoAgent features a closed-loop collaboration among a cloud-based Planning Agent and two edge-based agents: the Execution Agent for action execution and the Observation Agent for verifying outcomes. The Observation Agent uses a Pre-Understanding Module to compress screen images into concise text, reducing token usage. In case of failure, the Planning Agent retrieves screen history and replans via a Reflection Module. Experiments on AndroidWorld show that EcoAgent maintains high task success rates while significantly reducing MLLM token consumption, enabling efficient and practical mobile automation.