🤖 AI Summary
To address the trade-off between decision accuracy and real-time performance in mobile GUI task automation, this paper proposes V-Droid—the first agent to employ a large language model (LLM) as an *action verifier* rather than an action generator. Its core innovation is the *verifier-driven paradigm*, which comprises: (1) a discretized action space, (2) a prefill-only workflow, (3) pairwise progress preference training, and (4) a human-in-the-loop annotation mechanism to enhance both labeling efficiency and quality. Through low-latency inference optimization, V-Droid achieves state-of-the-art task success rates of 59.5% on AndroidWorld, 38.3% on AndroidLab, and 49.0% on MobileAgentBench—surpassing all prior methods. Crucially, its average per-step latency is only 0.7 seconds, effectively reconciling reliability with real-time responsiveness.
📝 Abstract
We propose V-Droid, a mobile GUI task automation agent. Unlike previous mobile agents that utilize Large Language Models (LLMs) as generators to directly generate actions at each step, V-Droid employs LLMs as verifiers to evaluate candidate actions before making final decisions. To realize this novel paradigm, we introduce a comprehensive framework for constructing verifier-driven mobile agents: the discretized action space construction coupled with the prefilling-only workflow to accelerate the verification process, the pair-wise progress preference training to significantly enhance the verifier's decision-making capabilities, and the scalable human-agent joint annotation scheme to efficiently collect the necessary data at scale. V-Droid sets a new state-of-the-art task success rate across several public mobile task automation benchmarks: 59.5% on AndroidWorld, 38.3% on AndroidLab, and 49% on MobileAgentBench, surpassing existing agents by 9.5%, 2.1%, and 9%, respectively. Furthermore, V-Droid achieves an impressively low latency of 0.7 seconds per step, making it the first mobile agent capable of delivering near-real-time, effective decision-making capabilities.