LiteGUI: Distilling Compact GUI Agents with Reinforcement Learning

📅 2026-05-08

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

This work addresses the performance bottlenecks of lightweight on-device vision-language GUI agents caused by limited model capacity, as well as issues of overfitting, catastrophic forgetting, and policy rigidity induced by supervised fine-tuning. To overcome these challenges, the paper proposes a novel training paradigm that eliminates the need for supervised fine-tuning. It introduces, for the first time, a systematic application of general knowledge distillation to GUI agents, integrating guided online policy distillation, a dynamic retrieval mechanism, and an automated multi-solution trajectory generation pipeline. Furthermore, a multi-solution dual-level GRPO reinforcement learning framework is designed to jointly optimize subtask planning and action execution. Experiments demonstrate that the proposed method achieves state-of-the-art performance among lightweight models across all benchmarks, matching or surpassing significantly larger models and substantially unlocking the potential of 2B/3B-scale agents beyond conventional imitation learning.

📝 Abstract

Developing lightweight, on-device vision-language GUI agents is essential for efficient cross-platform automated interaction. However, current on-device agents are constrained by limited model capacity, and further performance improvements remain urgently needed. Traditional Supervised Fine-Tuning (SFT) for small-scale models often leads to overfitting, catastrophic forgetting and policy rigidity, and thus fails to fully address these challenges. In this work, we propose a novel SFT-free training paradigm that significantly enhances the performance of small-scale models. We first present the initial systematic integration of generalized knowledge distillation into the GUI agent domain via Guided On-policy Distillation. By incorporating oracle reference trajectories together with a dynamic retrieval mechanism, our method reduces hallucinations and mitigates the cognitive misalignment inherent in multi-solution GUI tasks. Building on this foundation, we further introduce a Multi-solution Dual-level GRPO framework that jointly aligns macro-level subtask planning with micro-level execution matching, thereby improving exploration in long-horizon GUI agent scenarios. In addition, we construct an automated data generation pipeline to synthesize GUI task trajectories with rich multi-solution annotations. Extensive experiments show that our method achieves state-of-the-art performance among lightweight models while remaining competitive with substantially larger-scale models across all benchmarks. Ablation studies further demonstrate that structured on-policy distillation and multi-solution dual-level exploration can fully unlock the capabilities of 2B/3B scale agents, surpassing the performance limits of conventional imitation learning.

Problem

Research questions and friction points this paper is trying to address.

on-device GUI agents

model capacity limitation

supervised fine-tuning

catastrophic forgetting

policy rigidity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Knowledge Distillation

Reinforcement Learning

On-device GUI Agents

Multi-solution Exploration

GRPO

🔎 Similar Papers

GUICourse: From General Vision Language Models to Versatile GUI Agents

2024-06-17arXiv.orgCitations: 35

GUI Odyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices

2024-06-12arXiv.orgCitations: 47