LightAgent: Mobile Agentic Foundation Models

📅 2025-10-24

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

Mobile GUI agents face a fundamental trade-off between the limited capabilities of on-device small language models (≤4B) and the deployment infeasibility or prohibitive cost of larger models (≥7B). To address this, we propose a device-cloud collaborative lightweight agent framework. It employs Qwen2.5-VL-3B as the on-device foundation model and introduces a dynamic task allocation mechanism—defaulting to on-device execution while upgrading to cloud-based inference only when necessary. The framework integrates a two-stage fine-tuning pipeline (SFT followed by GRPO), training on synthetically generated multimodal GUI data, lightweight interaction history modeling, and real-time complexity assessment to enable efficient long-horizon reasoning. Evaluated on benchmarks including AndroidLab, our approach matches or exceeds the performance of significantly larger models while reducing cloud invocation frequency by 62%. This breakthrough effectively resolves the longstanding tripartite trade-off among capability, cost, and deployment feasibility in mobile multimodal interaction.

Technology Category

Application Category

📝 Abstract

With the advancement of multimodal large language models (MLLMs), building GUI agent systems has become an increasingly promising direction-especially for mobile platforms, given their rich app ecosystems and intuitive touch interactions. Yet mobile GUI agents face a critical dilemma: truly on-device models (4B or smaller) lack sufficient performance, while capable models (starting from 7B) are either too large for mobile deployment or prohibitively costly (e.g., cloud-only closed-source MLLMs). To resolve this, we propose LightAgent, a mobile agentic foundation model solution that leverages device-cloud collaboration to tap the cost-efficiency of on-device models and the high capability of cloud models, while avoiding their drawbacks. Specifically, LightAgent enhances Qwen2.5-VL-3B via two-stage SFT->GRPO training on synthetic GUI data for strong decision-making, integrates an efficient long-reasoning mechanism to utilize historical interactions under tight resources, and defaults to on-device execution-only escalating challenging subtasks to the cloud via real-time complexity assessment. Experiments on the online AndroidLab benchmark and diverse apps show LightAgent matches or nears larger models, with a significant reduction in cloud costs.

Problem

Research questions and friction points this paper is trying to address.

Mobile GUI agents struggle with balancing on-device performance and cloud costs

Small on-device models lack capability while capable models are too large

Need efficient device-cloud collaboration for mobile agentic foundation models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Device-cloud collaboration for mobile GUI agents

Two-stage SFT->GRPO training on synthetic data

Real-time complexity assessment for cloud escalation

🔎 Similar Papers

Large Model Based Agents: State-of-the-Art, Cooperation Paradigms, Security and Privacy, and Future Trends