Succeed or Learn Slowly: Sample Efficient Off-Policy Reinforcement Learning for Mobile App Control

📅 2025-09-01

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

To address low sample efficiency and unstable policy gradient updates in reinforcement learning for multi-turn, sparse-reward tasks—such as mobile application control with foundation models—this paper proposes SoLS (Succeed or Learn Slowly), an off-policy RL algorithm. Methodologically, SoLS introduces Success Transition Replay (STR), enabling direct policy updates on positive samples (successful trajectories) while applying conservative updates to negative samples. It further integrates policy regularization with foundation model fine-tuning within an Actor-Critic framework to enhance training robustness. Evaluated on the AndroidWorld benchmark, SoLS achieves over 17% average performance improvement over state-of-the-art methods, attains 5–60× faster inference speed than GPT-4o, and significantly reduces computational overhead. Crucially, SoLS is the first approach to jointly achieve high sample efficiency and lightweight deployment for foundation-model-based mobile UI control.

Technology Category

Application Category

📝 Abstract

Reinforcement learning (RL) using foundation models for policy approximations in multi-turn tasks remains challenging. We identify two main limitations related to sparse reward settings and policy gradient updates, based on which we formulate a key insight: updates from positive samples with high returns typically do not require policy regularisation, whereas updates from negative samples, reflecting undesirable behaviour, can harm model performance. This paper introduces Succeed or Learn Slowly (SoLS), a novel off-policy RL algorithm evaluated on mobile app control tasks. SoLS improves sample efficiency when fine-tuning foundation models for user interface navigation via a modified off-policy actor-critic approach, applying direct policy updates for positive samples and conservative, regularised updates for negative ones to prevent model degradation. We augment SoLS with Successful Transition Replay (STR), which prioritises learning from successful interactions, further improving sample efficiency. We evaluate SoLS on the AndroidWorld benchmark, where it significantly outperforms existing methods (at least 17% relative increase), including prompt-engineering and RL approaches, while requiring substantially fewer computational resources than GPT-4o-based methods with 5-60x faster inference.

Problem

Research questions and friction points this paper is trying to address.

Improving sample efficiency in off-policy reinforcement learning

Addressing policy degradation from negative training samples

Enhancing mobile app control through modified actor-critic approach

Innovation

Methods, ideas, or system contributions that make the work stand out.

Off-policy actor-critic with modified updates

Direct policy updates for positive samples

Conservative regularized updates for negatives

🔎 Similar Papers

No similar papers found.