Succeed or Learn Slowly: Sample Efficient Off-Policy Reinforcement Learning for Mobile App Control

📅 2025-09-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address low sample efficiency and unstable policy gradient updates in reinforcement learning for multi-turn, sparse-reward tasks—such as mobile application control with foundation models—this paper proposes SoLS (Succeed or Learn Slowly), an off-policy RL algorithm. Methodologically, SoLS introduces Success Transition Replay (STR), enabling direct policy updates on positive samples (successful trajectories) while applying conservative updates to negative samples. It further integrates policy regularization with foundation model fine-tuning within an Actor-Critic framework to enhance training robustness. Evaluated on the AndroidWorld benchmark, SoLS achieves over 17% average performance improvement over state-of-the-art methods, attains 5–60× faster inference speed than GPT-4o, and significantly reduces computational overhead. Crucially, SoLS is the first approach to jointly achieve high sample efficiency and lightweight deployment for foundation-model-based mobile UI control.

Technology Category

Application Category

📝 Abstract
Reinforcement learning (RL) using foundation models for policy approximations in multi-turn tasks remains challenging. We identify two main limitations related to sparse reward settings and policy gradient updates, based on which we formulate a key insight: updates from positive samples with high returns typically do not require policy regularisation, whereas updates from negative samples, reflecting undesirable behaviour, can harm model performance. This paper introduces Succeed or Learn Slowly (SoLS), a novel off-policy RL algorithm evaluated on mobile app control tasks. SoLS improves sample efficiency when fine-tuning foundation models for user interface navigation via a modified off-policy actor-critic approach, applying direct policy updates for positive samples and conservative, regularised updates for negative ones to prevent model degradation. We augment SoLS with Successful Transition Replay (STR), which prioritises learning from successful interactions, further improving sample efficiency. We evaluate SoLS on the AndroidWorld benchmark, where it significantly outperforms existing methods (at least 17% relative increase), including prompt-engineering and RL approaches, while requiring substantially fewer computational resources than GPT-4o-based methods with 5-60x faster inference.
Problem

Research questions and friction points this paper is trying to address.

Improving sample efficiency in off-policy reinforcement learning
Addressing policy degradation from negative training samples
Enhancing mobile app control through modified actor-critic approach
Innovation

Methods, ideas, or system contributions that make the work stand out.

Off-policy actor-critic with modified updates
Direct policy updates for positive samples
Conservative regularized updates for negatives
🔎 Similar Papers
No similar papers found.