Adaptive Milestone Reward for GUI Agents

📅 2026-02-12

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

This work addresses the challenges of sparse rewards and reward misalignment in long-horizon GUI tasks within reinforcement learning by proposing ADMIRE (Adaptive Milestone-based Reward mechanism). ADMIRE dynamically extracts verifiable milestones through trajectory clustering and integrates an asymmetric credit assignment strategy that denoises successful trajectories while providing constructive guidance for failed ones. By uniquely combining dynamically verifiable milestones with asymmetric credit assignment, ADMIRE effectively mitigates the tension between reward sparsity and reward hacking. Evaluated on the AndroidWorld benchmark, ADMIRE achieves an absolute success rate improvement of over 10% across multiple base models and demonstrates strong generalization capabilities in heterogeneous environments, including web navigation and embodied intelligence tasks.

Technology Category

Application Category

📝 Abstract

Reinforcement Learning (RL) has emerged as a mainstream paradigm for training Mobile GUI Agents, yet it struggles with the temporal credit assignment problem inherent in long-horizon tasks. A primary challenge lies in the trade-off between reward fidelity and density: outcome reward offers high fidelity but suffers from signal sparsity, while process reward provides dense supervision but remains prone to bias and reward hacking. To resolve this conflict, we propose the Adaptive Milestone Reward (ADMIRE) mechanism. ADMIRE constructs a verifiable, adaptive reward system by anchoring trajectory to milestones, which are dynamically distilled from successful explorations. Crucially, ADMIRE integrates an asymmetric credit assignment strategy that denoises successful trajectories and scaffolds failed trajectories. Extensive experiments demonstrate that ADMIRE consistently yields over 10% absolute improvement in success rate across different base models on AndroidWorld. Moreover, the method exhibits robust generalizability, achieving strong performance across diverse RL algorithms and heterogeneous environments such as web navigation and embodied tasks.

Problem

Research questions and friction points this paper is trying to address.

temporal credit assignment

reward sparsity

reward hacking

long-horizon tasks

GUI agents

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive Milestone Reward

Temporal Credit Assignment

Reinforcement Learning