Ferret-UI Lite: Lessons from Building Small On-Device GUI Agents

📅 2025-09-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Building lightweight, on-device GUI agents for mobile, web, and desktop platforms remains challenging. This paper proposes a novel on-device GUI agent architecture specifically designed for small language models (3B parameters), integrating chain-of-thought reasoning, vision-based tool invocation, and a customized reinforcement learning reward mechanism, trained jointly on real and synthetic GUI data. Our key contribution is an efficient cross-device interface perception and interaction framework that simultaneously achieves model compactness and strong task generalization. Experiments demonstrate state-of-the-art performance among on-device methods: element localization accuracy reaches 91.6% on ScreenSpot-V2, 53.3% on ScreenSpot-Pro, and 61.2% on OSWorld-G; navigation success rates achieve 28.0% on AndroidWorld and 19.8% on OSWorld—substantially outperforming existing on-device baselines.

Technology Category

Application Category

📝 Abstract
Developing autonomous agents that effectively interact with Graphic User Interfaces (GUIs) remains a challenging open problem, especially for small on-device models. In this paper, we present Ferret-UI Lite, a compact, end-to-end GUI agent that operates across diverse platforms, including mobile, web, and desktop. Utilizing techniques optimized for developing small models, we build our 3B Ferret-UI Lite agent through curating a diverse GUI data mixture from real and synthetic sources, strengthening inference-time performance through chain-of-thought reasoning and visual tool-use, and reinforcement learning with designed rewards. Ferret-UI Lite achieves competitive performance with other small-scale GUI agents. In GUI grounding, Ferret-UI Lite attains scores of $91.6%$, $53.3%$, and $61.2%$ on the ScreenSpot-V2, ScreenSpot-Pro, and OSWorld-G benchmarks, respectively. For GUI navigation, Ferret-UI Lite achieves success rates of $28.0%$ on AndroidWorld and $19.8%$ on OSWorld. We share our methods and lessons learned from developing compact, on-device GUI agents.
Problem

Research questions and friction points this paper is trying to address.

Developing small autonomous agents for GUI interaction
Optimizing on-device models for cross-platform GUI operations
Enhancing GUI grounding and navigation with compact agents
Innovation

Methods, ideas, or system contributions that make the work stand out.

Compact end-to-end GUI agent for diverse platforms
Curated diverse GUI data from real and synthetic sources
Enhanced inference with chain-of-thought reasoning and visual tools
🔎 Similar Papers
No similar papers found.