🤖 AI Summary
Mobile GUI agents face a fundamental trade-off between the limited capabilities of on-device small language models (≤4B) and the deployment infeasibility or prohibitive cost of larger models (≥7B). To address this, we propose a device-cloud collaborative lightweight agent framework. It employs Qwen2.5-VL-3B as the on-device foundation model and introduces a dynamic task allocation mechanism—defaulting to on-device execution while upgrading to cloud-based inference only when necessary. The framework integrates a two-stage fine-tuning pipeline (SFT followed by GRPO), training on synthetically generated multimodal GUI data, lightweight interaction history modeling, and real-time complexity assessment to enable efficient long-horizon reasoning. Evaluated on benchmarks including AndroidLab, our approach matches or exceeds the performance of significantly larger models while reducing cloud invocation frequency by 62%. This breakthrough effectively resolves the longstanding tripartite trade-off among capability, cost, and deployment feasibility in mobile multimodal interaction.
📝 Abstract
With the advancement of multimodal large language models (MLLMs), building GUI agent systems has become an increasingly promising direction-especially for mobile platforms, given their rich app ecosystems and intuitive touch interactions. Yet mobile GUI agents face a critical dilemma: truly on-device models (4B or smaller) lack sufficient performance, while capable models (starting from 7B) are either too large for mobile deployment or prohibitively costly (e.g., cloud-only closed-source MLLMs). To resolve this, we propose LightAgent, a mobile agentic foundation model solution that leverages device-cloud collaboration to tap the cost-efficiency of on-device models and the high capability of cloud models, while avoiding their drawbacks. Specifically, LightAgent enhances Qwen2.5-VL-3B via two-stage SFT->GRPO training on synthetic GUI data for strong decision-making, integrates an efficient long-reasoning mechanism to utilize historical interactions under tight resources, and defaults to on-device execution-only escalating challenging subtasks to the cloud via real-time complexity assessment. Experiments on the online AndroidLab benchmark and diverse apps show LightAgent matches or nears larger models, with a significant reduction in cloud costs.