Generalization in Online Reinforcement Learning for Mobile Agents

📅 2026-03-08

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

This work addresses the limited zero-shot generalization capabilities of existing mobile agents on unseen task instances, templates, and applications, as well as the absence of standardized evaluation benchmarks and open-source reinforcement learning frameworks. The problem is formalized as a Contextual Markov Decision Process (CMDP), and the study introduces AndroidWorld-Generalization—the first three-tier benchmark for evaluating mobile agent generalization—alongside an online reinforcement learning framework based on Group Relative Policy Optimization (GRPO) with containerized asynchronous rollout collection. Integrated with a 7B-parameter Vision-Language Model (VLM), the proposed approach outperforms supervised fine-tuning baselines by 26.1% on unseen task instances, and achieves gains of 15.7% and 8.3% on unseen templates and applications, respectively. Further performance improvements are observed through few-shot test-time adaptation to novel applications.

Technology Category

Application Category

📝 Abstract

Graphical user interface (GUI)-based mobile agents automate digital tasks on mobile devices by interpreting natural-language instructions and interacting with the screen. While recent methods apply reinforcement learning (RL) to train vision-language-model(VLM) agents in interactive environments with a primary focus on performance, generalization remains underexplored due to the lack of standardized benchmarks and open-source RL systems. In this work, we formalize the problem as a Contextual Markov Decision Process (CMDP) and introduce \textbf{AndroidWorld-Generalization}, a benchmark with three increasingly challenging regimes for evaluating zero-shot generalization to unseen task instances, templates, and applications. We further propose an RL training system that integrates Group Relative Policy Optimization (GRPO) with a scalable rollout collection system, consisting of containerized infrastructure and asynchronous execution % , and error recovery to support reliable and efficient training. Experiments on AndroidWorld-Generalization show that RL enables a 7B-parameter VLM agent to surpass supervised fine-tuning baselines, yielding a 26.1\% improvement on unseen instances but only limited gains on unseen templates (15.7\%) and apps (8.3\%), underscoring the challenges of generalization. As a preliminary step, we demonstrate that few-shot adaptation at test-time improves performance on unseen apps, motivating future research in this direction. To support reproducibility and fair comparison, we open-source the full RL training system, including the environment, task suite, models, prompt configurations, and the underlying infrastructure \footnote{https://github.com/zihuanjiang/AndroidWorld-Generalization}.

Problem

Research questions and friction points this paper is trying to address.

generalization

online reinforcement learning

mobile agents

zero-shot generalization

benchmark

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generalization

Reinforcement Learning

Vision-Language Model