🤖 AI Summary
This work addresses the performance bottlenecks of lightweight on-device vision-language GUI agents caused by limited model capacity, as well as issues of overfitting, catastrophic forgetting, and policy rigidity induced by supervised fine-tuning. To overcome these challenges, the paper proposes a novel training paradigm that eliminates the need for supervised fine-tuning. It introduces, for the first time, a systematic application of general knowledge distillation to GUI agents, integrating guided online policy distillation, a dynamic retrieval mechanism, and an automated multi-solution trajectory generation pipeline. Furthermore, a multi-solution dual-level GRPO reinforcement learning framework is designed to jointly optimize subtask planning and action execution. Experiments demonstrate that the proposed method achieves state-of-the-art performance among lightweight models across all benchmarks, matching or surpassing significantly larger models and substantially unlocking the potential of 2B/3B-scale agents beyond conventional imitation learning.
📝 Abstract
Developing lightweight, on-device vision-language GUI agents is essential for efficient cross-platform automated interaction. However, current on-device agents are constrained by limited model capacity, and further performance improvements remain urgently needed. Traditional Supervised Fine-Tuning (SFT) for small-scale models often leads to overfitting, catastrophic forgetting and policy rigidity, and thus fails to fully address these challenges. In this work, we propose a novel SFT-free training paradigm that significantly enhances the performance of small-scale models. We first present the initial systematic integration of generalized knowledge distillation into the GUI agent domain via Guided On-policy Distillation. By incorporating oracle reference trajectories together with a dynamic retrieval mechanism, our method reduces hallucinations and mitigates the cognitive misalignment inherent in multi-solution GUI tasks. Building on this foundation, we further introduce a Multi-solution Dual-level GRPO framework that jointly aligns macro-level subtask planning with micro-level execution matching, thereby improving exploration in long-horizon GUI agent scenarios. In addition, we construct an automated data generation pipeline to synthesize GUI task trajectories with rich multi-solution annotations. Extensive experiments show that our method achieves state-of-the-art performance among lightweight models while remaining competitive with substantially larger-scale models across all benchmarks. Ablation studies further demonstrate that structured on-policy distillation and multi-solution dual-level exploration can fully unlock the capabilities of 2B/3B scale agents, surpassing the performance limits of conventional imitation learning.