MobileGym: A Verifiable and Highly Parallel Simulation Platform for Mobile GUI Agent Research

📅 2026-05-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the lack of high-fidelity, verifiable, and scalable simulation environments for mobile GUI agents. The authors propose a lightweight, browser-hosted mobile application simulation platform that leverages structured JSON-based states to enable fully controllable and deterministically evaluable interactions. The platform introduces a hierarchical state model and a declarative task definition framework, achieving, for the first time, verifiable outcome signals and low-cost, highly concurrent simulation instances on everyday mobile applications—supporting hundreds of parallel instances per server (each consuming approximately 400 MB memory with a 3-second cold start). When combined with the GRPO algorithm, the approach yields a 12.8-percentage-point performance improvement on a 256-task benchmark suite, while retaining 95.1% of simulation-trained efficacy during real-device execution.
📝 Abstract
We present MobileGym, a browser-hosted, lightweight, fully controllable environment for everyday mobile use, targeting interaction fidelity without replicating proprietary backends. It enables two capabilities previously out of reach for everyday apps: verifiable outcome signals through deterministic state-based judging over structured JSON state, and scalable online RL through low-cost parallel rollouts. The full environment state is captured, configured, forked, and compared as structured JSON, and a single server can host hundreds of parallel instances, with about 400 MB memory per instance and about 3 s cold start. A layered state model and a declarative task-definition framework keep state programmability and task creation practical at scale, and a single programmatic judging mechanism delivers both deterministic evaluation verdicts and dense RL rewards. The accompanying MobileGym-Bench provides 416 parameterized task templates, including 256 test and 160 train templates, over 28 apps, with deterministic judges and a structured AnswerSheet protocol that avoids free-text matching failures. In a Sim-to-Real case study, GRPO on Qwen3-VL-4B-Instruct gains +12.8 percentage points on the 256-task test set, and on a 59-task real-device signal subset, real-device execution retains 95.1% of the simulation-side training gain. Project page: https://mobilegym.github.io.
Problem

Research questions and friction points this paper is trying to address.

mobile GUI agent
verifiable simulation
parallel reinforcement learning
deterministic evaluation
sim-to-real transfer
Innovation

Methods, ideas, or system contributions that make the work stand out.

verifiable simulation
parallel RL
structured JSON state
declarative task definition
Sim-to-Real transfer
D
Dingbang Wu
Institute of Automation, Chinese Academy of Sciences
R
Rui Hao
Institute of Automation, Chinese Academy of Sciences
Haiyang Wang
Haiyang Wang
Peking University
AgentFoundation ModelCoding
Shuzhe Wu
Shuzhe Wu
Institute of Computing Technology, Chinese Academy of Sciences
Computer VisionMachine Learning
Han Xiao
Han Xiao
MMLab CUHK
Computer VisionMachine Learning
Z
Zhenghong Li
Institute of Automation, Chinese Academy of Sciences
B
Bojiang Zhou
Institute of Automation, Chinese Academy of Sciences
Z
Zheng Ju
Institute of Automation, Chinese Academy of Sciences
Z
Zichen Liu
Institute of Automation, Chinese Academy of Sciences
L
Lue Fan
Institute of Automation, Chinese Academy of Sciences
Zhaoxiang Zhang
Zhaoxiang Zhang
Institute of Automation, Chinese Academy of Sciences
Computer VisionPattern RecognitionBiologically-inspired Learning