UI-Genie: A Self-Improving Approach for Iteratively Boosting MLLM-based Mobile GUI Agents

📅 2025-05-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
GUI agents face two core challenges: difficulty in trajectory verification and the scarcity of scalable, high-quality training data. To address these, we propose UI-Genie, a self-evolving framework. It introduces UI-Genie-RM—the first GUI-specialized interleaved image-text reward model—enabling unified action-level and task-level reward modeling. We design rule-based validation, controllable trajectory corruption, and hard negative mining to construct the first reward-guided synthetic dataset (517k reward annotations + 16k trajectories). Furthermore, we establish a reward-guided exploration–verification closed loop and a dynamic-environment-aware iterative self-distillation mechanism. Through three generations of joint data–model co-evolution, UI-Genie achieves state-of-the-art performance across multiple GUI benchmarks, with substantial gains in complex task success rates. All code and datasets are publicly released.

Technology Category

Application Category

📝 Abstract
In this paper, we introduce UI-Genie, a self-improving framework addressing two key challenges in GUI agents: verification of trajectory outcome is challenging and high-quality training data are not scalable. These challenges are addressed by a reward model and a self-improving pipeline, respectively. The reward model, UI-Genie-RM, features an image-text interleaved architecture that efficiently pro- cesses historical context and unifies action-level and task-level rewards. To sup- port the training of UI-Genie-RM, we develop deliberately-designed data genera- tion strategies including rule-based verification, controlled trajectory corruption, and hard negative mining. To address the second challenge, a self-improvement pipeline progressively expands solvable complex GUI tasks by enhancing both the agent and reward models through reward-guided exploration and outcome verification in dynamic environments. For training the model, we generate UI- Genie-RM-517k and UI-Genie-Agent-16k, establishing the first reward-specific dataset for GUI agents while demonstrating high-quality synthetic trajectory gen- eration without manual annotation. Experimental results show that UI-Genie achieves state-of-the-art performance across multiple GUI agent benchmarks with three generations of data-model self-improvement. We open-source our complete framework implementation and generated datasets to facilitate further research in https://github.com/Euphoria16/UI-Genie.
Problem

Research questions and friction points this paper is trying to address.

Addresses GUI agent trajectory outcome verification challenges
Solves scalable high-quality training data generation issues
Enhances agent performance via self-improving pipeline and reward model
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-improving framework with reward model
Image-text interleaved reward architecture
Reward-guided exploration in dynamic environments
🔎 Similar Papers
No similar papers found.
H
Han Xiao
CUHK MMLab
G
Guozhi Wang
vivo AI Lab
Yuxiang Chai
Yuxiang Chai
The Chinese University of Hong Kong
Computer VisionLLMAgent
Zimu Lu
Zimu Lu
Ph.D. student at the Chinese University of Hong Kong
AI ReasoningLarge Language Model
Weifeng Lin
Weifeng Lin
The Chinese University of Hong Kong
Deep LearningComputer Vision
H
Hao He
CUHK MMLab
L
Lue Fan
CUHK MMLab
L
Liuyang Bian
vivo AI Lab
R
Rui Hu
vivo AI Lab
L
Liang Liu
vivo AI Lab
S
Shuai Ren
vivo AI Lab
Y
Yafei Wen
vivo AI Lab
Xiaoxin Chen
Xiaoxin Chen
Coriell Institute for Medical Research
Aojun Zhou
Aojun Zhou
The Chinese University of Hong Kong
Deep Learning
H
Hongsheng Li
CUHK MMLab