LightAgent: Mobile Agentic Foundation Models

📅 2025-10-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Mobile GUI agents face a fundamental trade-off between the limited capabilities of on-device small language models (≤4B) and the deployment infeasibility or prohibitive cost of larger models (≥7B). To address this, we propose a device-cloud collaborative lightweight agent framework. It employs Qwen2.5-VL-3B as the on-device foundation model and introduces a dynamic task allocation mechanism—defaulting to on-device execution while upgrading to cloud-based inference only when necessary. The framework integrates a two-stage fine-tuning pipeline (SFT followed by GRPO), training on synthetically generated multimodal GUI data, lightweight interaction history modeling, and real-time complexity assessment to enable efficient long-horizon reasoning. Evaluated on benchmarks including AndroidLab, our approach matches or exceeds the performance of significantly larger models while reducing cloud invocation frequency by 62%. This breakthrough effectively resolves the longstanding tripartite trade-off among capability, cost, and deployment feasibility in mobile multimodal interaction.

Technology Category

Application Category

📝 Abstract
With the advancement of multimodal large language models (MLLMs), building GUI agent systems has become an increasingly promising direction-especially for mobile platforms, given their rich app ecosystems and intuitive touch interactions. Yet mobile GUI agents face a critical dilemma: truly on-device models (4B or smaller) lack sufficient performance, while capable models (starting from 7B) are either too large for mobile deployment or prohibitively costly (e.g., cloud-only closed-source MLLMs). To resolve this, we propose LightAgent, a mobile agentic foundation model solution that leverages device-cloud collaboration to tap the cost-efficiency of on-device models and the high capability of cloud models, while avoiding their drawbacks. Specifically, LightAgent enhances Qwen2.5-VL-3B via two-stage SFT->GRPO training on synthetic GUI data for strong decision-making, integrates an efficient long-reasoning mechanism to utilize historical interactions under tight resources, and defaults to on-device execution-only escalating challenging subtasks to the cloud via real-time complexity assessment. Experiments on the online AndroidLab benchmark and diverse apps show LightAgent matches or nears larger models, with a significant reduction in cloud costs.
Problem

Research questions and friction points this paper is trying to address.

Mobile GUI agents struggle with balancing on-device performance and cloud costs
Small on-device models lack capability while capable models are too large
Need efficient device-cloud collaboration for mobile agentic foundation models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Device-cloud collaboration for mobile GUI agents
Two-stage SFT->GRPO training on synthetic data
Real-time complexity assessment for cloud escalation
🔎 Similar Papers
No similar papers found.
Yangqin Jiang
Yangqin Jiang
University of Hong Kong
Data Mining
C
Chao Huang
University of Hong Kong