🤖 AI Summary
This work addresses the lack of high-fidelity, verifiable, and scalable simulation environments for mobile GUI agents. The authors propose a lightweight, browser-hosted mobile application simulation platform that leverages structured JSON-based states to enable fully controllable and deterministically evaluable interactions. The platform introduces a hierarchical state model and a declarative task definition framework, achieving, for the first time, verifiable outcome signals and low-cost, highly concurrent simulation instances on everyday mobile applications—supporting hundreds of parallel instances per server (each consuming approximately 400 MB memory with a 3-second cold start). When combined with the GRPO algorithm, the approach yields a 12.8-percentage-point performance improvement on a 256-task benchmark suite, while retaining 95.1% of simulation-trained efficacy during real-device execution.
📝 Abstract
We present MobileGym, a browser-hosted, lightweight, fully controllable environment for everyday mobile use, targeting interaction fidelity without replicating proprietary backends. It enables two capabilities previously out of reach for everyday apps: verifiable outcome signals through deterministic state-based judging over structured JSON state, and scalable online RL through low-cost parallel rollouts. The full environment state is captured, configured, forked, and compared as structured JSON, and a single server can host hundreds of parallel instances, with about 400 MB memory per instance and about 3 s cold start. A layered state model and a declarative task-definition framework keep state programmability and task creation practical at scale, and a single programmatic judging mechanism delivers both deterministic evaluation verdicts and dense RL rewards. The accompanying MobileGym-Bench provides 416 parameterized task templates, including 256 test and 160 train templates, over 28 apps, with deterministic judges and a structured AnswerSheet protocol that avoids free-text matching failures. In a Sim-to-Real case study, GRPO on Qwen3-VL-4B-Instruct gains +12.8 percentage points on the 256-task test set, and on a 59-task real-device signal subset, real-device execution retains 95.1% of the simulation-side training gain. Project page: https://mobilegym.github.io.