🤖 AI Summary
Building lightweight, on-device GUI agents for mobile, web, and desktop platforms remains challenging. This paper proposes a novel on-device GUI agent architecture specifically designed for small language models (3B parameters), integrating chain-of-thought reasoning, vision-based tool invocation, and a customized reinforcement learning reward mechanism, trained jointly on real and synthetic GUI data. Our key contribution is an efficient cross-device interface perception and interaction framework that simultaneously achieves model compactness and strong task generalization. Experiments demonstrate state-of-the-art performance among on-device methods: element localization accuracy reaches 91.6% on ScreenSpot-V2, 53.3% on ScreenSpot-Pro, and 61.2% on OSWorld-G; navigation success rates achieve 28.0% on AndroidWorld and 19.8% on OSWorld—substantially outperforming existing on-device baselines.
📝 Abstract
Developing autonomous agents that effectively interact with Graphic User Interfaces (GUIs) remains a challenging open problem, especially for small on-device models. In this paper, we present Ferret-UI Lite, a compact, end-to-end GUI agent that operates across diverse platforms, including mobile, web, and desktop. Utilizing techniques optimized for developing small models, we build our 3B Ferret-UI Lite agent through curating a diverse GUI data mixture from real and synthetic sources, strengthening inference-time performance through chain-of-thought reasoning and visual tool-use, and reinforcement learning with designed rewards. Ferret-UI Lite achieves competitive performance with other small-scale GUI agents. In GUI grounding, Ferret-UI Lite attains scores of $91.6%$, $53.3%$, and $61.2%$ on the ScreenSpot-V2, ScreenSpot-Pro, and OSWorld-G benchmarks, respectively. For GUI navigation, Ferret-UI Lite achieves success rates of $28.0%$ on AndroidWorld and $19.8%$ on OSWorld. We share our methods and lessons learned from developing compact, on-device GUI agents.