LiteGUI: Distilling Compact GUI Agents with Reinforcement Learning

📅 2026-05-08
📈 Citations: 0
Influential: 0
📄 PDF

career value

228K/year
🤖 AI Summary
This work addresses the performance bottlenecks of lightweight on-device vision-language GUI agents caused by limited model capacity, as well as issues of overfitting, catastrophic forgetting, and policy rigidity induced by supervised fine-tuning. To overcome these challenges, the paper proposes a novel training paradigm that eliminates the need for supervised fine-tuning. It introduces, for the first time, a systematic application of general knowledge distillation to GUI agents, integrating guided online policy distillation, a dynamic retrieval mechanism, and an automated multi-solution trajectory generation pipeline. Furthermore, a multi-solution dual-level GRPO reinforcement learning framework is designed to jointly optimize subtask planning and action execution. Experiments demonstrate that the proposed method achieves state-of-the-art performance among lightweight models across all benchmarks, matching or surpassing significantly larger models and substantially unlocking the potential of 2B/3B-scale agents beyond conventional imitation learning.
📝 Abstract
Developing lightweight, on-device vision-language GUI agents is essential for efficient cross-platform automated interaction. However, current on-device agents are constrained by limited model capacity, and further performance improvements remain urgently needed. Traditional Supervised Fine-Tuning (SFT) for small-scale models often leads to overfitting, catastrophic forgetting and policy rigidity, and thus fails to fully address these challenges. In this work, we propose a novel SFT-free training paradigm that significantly enhances the performance of small-scale models. We first present the initial systematic integration of generalized knowledge distillation into the GUI agent domain via Guided On-policy Distillation. By incorporating oracle reference trajectories together with a dynamic retrieval mechanism, our method reduces hallucinations and mitigates the cognitive misalignment inherent in multi-solution GUI tasks. Building on this foundation, we further introduce a Multi-solution Dual-level GRPO framework that jointly aligns macro-level subtask planning with micro-level execution matching, thereby improving exploration in long-horizon GUI agent scenarios. In addition, we construct an automated data generation pipeline to synthesize GUI task trajectories with rich multi-solution annotations. Extensive experiments show that our method achieves state-of-the-art performance among lightweight models while remaining competitive with substantially larger-scale models across all benchmarks. Ablation studies further demonstrate that structured on-policy distillation and multi-solution dual-level exploration can fully unlock the capabilities of 2B/3B scale agents, surpassing the performance limits of conventional imitation learning.
Problem

Research questions and friction points this paper is trying to address.

on-device GUI agents
model capacity limitation
supervised fine-tuning
catastrophic forgetting
policy rigidity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Knowledge Distillation
Reinforcement Learning
On-device GUI Agents
Multi-solution Exploration
GRPO