🤖 AI Summary
To address the limited generalization capability of multimodal large language models (MLLMs) in mobile GUI action prediction, this paper pioneers the integration of rule-driven reinforcement learning (RL) into MLLM training. We propose a unified rule-based reward mechanism tailored for GUI actions, construct a high-quality, small-scale, challenging benchmark comprising 136 tasks, and design a data-efficient RL paradigm that eliminates the need for large-scale supervised fine-tuning. Our approach synergistically combines Group Relative Policy Optimization (GRPO), fine-tuned Qwen2.5-VL-3B, and joint modeling of action types and screen coordinates. On the in-distribution AndroidControl (ID) benchmark, our method achieves absolute improvements of +15.0% in action-type accuracy and +10.3% in coordinate localization accuracy. On the out-of-distribution ScreenSpot-Pro (OOD) benchmark, it surpasses prior baselines by 6.0%, matching the performance of a 7B model trained on 76K supervised samples.
📝 Abstract
The recent DeepSeek-R1 has showcased the emergence of reasoning capabilities in LLMs through reinforcement learning (RL) with rule-based rewards. Building on this idea, we are the first to explore how rule-based RL can enhance the reasoning capabilities of multimodal large language models (MLLMs) for graphic user interface (GUI) action prediction tasks. To this end, we curate a small yet high-quality dataset of 136 challenging tasks, encompassing five common action types on mobile devices. We also introduce a unified rule-based action reward, enabling model optimization via policy-based algorithms such as Group Relative Policy Optimization (GRPO). Experimental results demonstrate that our proposed data-efficient model, UI-R1-3B, achieves substantial improvements on both in-domain (ID) and out-of-domain (OOD) tasks. Specifically, on the ID benchmark AndroidControl, the action type accuracy improves by 15%, while grounding accuracy increases by 10.3%, compared with the base model (i.e. Qwen2.5-VL-3B). On the OOD GUI grounding benchmark ScreenSpot-Pro, our model surpasses the base model by 6.0% and achieves competitive performance with larger models (e.g., OS-Atlas-7B), which are trained via supervised fine-tuning (SFT) on 76K data. These results underscore the potential of rule-based reinforcement learning to advance GUI understanding and control, paving the way for future research in this domain.