UI-R1: Enhancing Action Prediction of GUI Agents by Reinforcement Learning

📅 2025-03-27

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

To address the limited generalization capability of multimodal large language models (MLLMs) in mobile GUI action prediction, this paper pioneers the integration of rule-driven reinforcement learning (RL) into MLLM training. We propose a unified rule-based reward mechanism tailored for GUI actions, construct a high-quality, small-scale, challenging benchmark comprising 136 tasks, and design a data-efficient RL paradigm that eliminates the need for large-scale supervised fine-tuning. Our approach synergistically combines Group Relative Policy Optimization (GRPO), fine-tuned Qwen2.5-VL-3B, and joint modeling of action types and screen coordinates. On the in-distribution AndroidControl (ID) benchmark, our method achieves absolute improvements of +15.0% in action-type accuracy and +10.3% in coordinate localization accuracy. On the out-of-distribution ScreenSpot-Pro (OOD) benchmark, it surpasses prior baselines by 6.0%, matching the performance of a 7B model trained on 76K supervised samples.

Technology Category

Application Category

📝 Abstract

The recent DeepSeek-R1 has showcased the emergence of reasoning capabilities in LLMs through reinforcement learning (RL) with rule-based rewards. Building on this idea, we are the first to explore how rule-based RL can enhance the reasoning capabilities of multimodal large language models (MLLMs) for graphic user interface (GUI) action prediction tasks. To this end, we curate a small yet high-quality dataset of 136 challenging tasks, encompassing five common action types on mobile devices. We also introduce a unified rule-based action reward, enabling model optimization via policy-based algorithms such as Group Relative Policy Optimization (GRPO). Experimental results demonstrate that our proposed data-efficient model, UI-R1-3B, achieves substantial improvements on both in-domain (ID) and out-of-domain (OOD) tasks. Specifically, on the ID benchmark AndroidControl, the action type accuracy improves by 15%, while grounding accuracy increases by 10.3%, compared with the base model (i.e. Qwen2.5-VL-3B). On the OOD GUI grounding benchmark ScreenSpot-Pro, our model surpasses the base model by 6.0% and achieves competitive performance with larger models (e.g., OS-Atlas-7B), which are trained via supervised fine-tuning (SFT) on 76K data. These results underscore the potential of rule-based reinforcement learning to advance GUI understanding and control, paving the way for future research in this domain.

Problem

Research questions and friction points this paper is trying to address.

Enhancing GUI action prediction using rule-based RL

Improving MLLMs' reasoning for mobile device tasks

Achieving data-efficient gains in ID and OOD benchmarks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Rule-based RL enhances MLLMs for GUI tasks

Unified rule-based action reward optimizes model

UI-R1-3B improves accuracy on ID and OOD tasks

🔎 Similar Papers

Identifying User Goals from UI Trajectories