UI-S1: Advancing GUI Automation via Semi-online Reinforcement Learning

📅 2025-09-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
GUI automation agents face dual challenges in multi-step tasks: offline RL suffers from the absence of trajectory-level rewards, while online RL encounters sparse rewards and high deployment costs. This paper proposes a semi-online reinforcement learning paradigm that simulates online interaction over offline trajectories, integrating long-horizon reward modeling with efficient training. Key contributions include: (1) a patch module that adaptively corrects trajectory distributional shifts; (2) the Simulated Online Performance (SOP) metric, which more accurately estimates true online performance; and (3) a joint optimization framework unifying trajectory replay, discounted future return computation, and step- and episode-level weighted advantage estimation. Evaluated on four dynamic benchmarks using a 7B-parameter model, our approach achieves state-of-the-art performance, significantly outperforming baselines (AndroidWorld +12.0%, AITW +23.8%).

Technology Category

Application Category

📝 Abstract
Graphical User Interface (GUI) agents have demonstrated remarkable progress in automating complex user interface interactions through reinforcement learning. However, current approaches face a fundamental dilemma: offline RL enables stable training on pre-collected trajectories, but struggles with multi-step task execution for lack of trajectory-level reward signals; online RL captures these signals through environment interaction, but suffers from sparse rewards and prohibitive deployment costs. To address it, we present Semi-online Reinforcement Learning, a novel paradigm that simulates online RL on offline trajectories. During each rollout process, we preserve the original model output within the multi-turn dialogue, where a Patch Module adaptively recovers the divergence between rollout and expert trajectories. To capture long-term training signals, Semi-online RL introduces discounted future returns into the reward computation and optimizes the policy with weighted step-level and episode-level advantages. We further introduce Semi-Online Performance (SOP), a metric that aligns better with true online performance, serving as a practical and effective proxy for real-world evaluation. Experiments show that ours Semi-online RL achieves SOTA performance among 7B models across four dynamic benchmarks, with significant gains over the base model (e.g., +12.0% on AndroidWorld, +23.8% on AITW), demonstrating significant progress in bridging the gap between offline training efficiency and online multi-turn reasoning. The code is available at https://github.com/X-PLUG/MobileAgent/tree/main/UI-S1.
Problem

Research questions and friction points this paper is trying to address.

Addresses offline RL's multi-step task execution limitations
Overcomes online RL's sparse rewards and high costs
Bridges gap between offline training and online reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Semi-online Reinforcement Learning paradigm
Patch Module adaptively recovers trajectory divergence
Discounted future returns reward computation
🔎 Similar Papers
No similar papers found.
Zhengxi Lu
Zhengxi Lu
Zhejiang University
MLLMAgent
Jiabo Ye
Jiabo Ye
Alibaba Inc. Tongyi Lab, mPLUG Team
Vision-LanguageGUI Agent
F
Fei Tang
Zhejiang University
Y
Yongliang Shen
Zhejiang University
H
Haiyang Xu
Tongyi Lab, Alibaba Group
Ziwei Zheng
Ziwei Zheng
Xi'an Jiaotong University
Dynamic Neural Network
Weiming Lu
Weiming Lu
Zhejiang University
Natural Language ProcessingLarge Language ModelsAGI
M
Ming Yan
Tongyi Lab, Alibaba Group
F
Fei Huang
Tongyi Lab, Alibaba Group
J
Jun Xiao
Zhejiang University
Y
Yueting Zhuang
Zhejiang University