Android Coach: Improve Online Agentic Training Efficiency with Single State Multiple Actions

📅 2026-04-08

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

This work addresses the challenges of high simulator latency and low sample efficiency in online reinforcement learning for Android agents, particularly stemming from the conventional single-state single-action paradigm that underutilizes costly-to-acquire states. To overcome this limitation, the authors propose the Android Coach framework, which introduces an innovative single-state multi-action training paradigm. By integrating a critic network, a process reward model, and a group advantage estimator, the framework substantially enhances sample utilization efficiency and policy learning stability without incurring additional simulator overhead. Experimental results demonstrate that the proposed method achieves a 7.5% and 8.3% higher success rate than UI-TARS-1.5-7B on AndroidLab and AndroidWorld benchmarks, respectively, while attaining 1.4× higher training efficiency than the baseline at equivalent performance levels.

Technology Category

Application Category

📝 Abstract

Online reinforcement learning (RL) serves as an effective method for enhancing the capabilities of Android agents. However, guiding agents to learn through online interaction is prohibitively expensive due to the high latency of emulators and the sample inefficiency of existing RL algorithms. We identify a fundamental limitation in current approaches: the Single State Single Action paradigm, which updates the policy with one-to-one state-action pairs from online one-way rollouts without fully exploring each costly emulator state. In this paper, we propose Android Coach, a novel framework that shifts the training paradigm to Single State Multiple Actions, allowing the agent to sample and utilize multiple actions for a single online state. We enable this without additional emulator overhead by learning a critic that estimates action values. To ensure the critic serves as a reliable coach, we integrate a process reward model and introduce a group-wise advantage estimator based on the averaged critic outputs. Extensive experiments demonstrate the effectiveness and efficiency of Android Coach: it achieves 7.5% and 8.3% success rate improvements on AndroidLab and AndroidWorld over UI-TARS-1.5-7B, and attains 1.4x higher training efficiency than Single State Single Action methods PPO and GRPO at matched success rates.

Problem

Research questions and friction points this paper is trying to address.

online reinforcement learning

sample inefficiency

emulator latency

Android agents

Single State Single Action

Innovation

Methods, ideas, or system contributions that make the work stand out.

Single State Multiple Actions

Android Coach

critic-based coaching