Advancing Mobile GUI Agents: A Verifier-Driven Approach to Practical Deployment

📅 2025-03-20

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

To address the trade-off between decision accuracy and real-time performance in mobile GUI task automation, this paper proposes V-Droid—the first agent to employ a large language model (LLM) as an *action verifier* rather than an action generator. Its core innovation is the *verifier-driven paradigm*, which comprises: (1) a discretized action space, (2) a prefill-only workflow, (3) pairwise progress preference training, and (4) a human-in-the-loop annotation mechanism to enhance both labeling efficiency and quality. Through low-latency inference optimization, V-Droid achieves state-of-the-art task success rates of 59.5% on AndroidWorld, 38.3% on AndroidLab, and 49.0% on MobileAgentBench—surpassing all prior methods. Crucially, its average per-step latency is only 0.7 seconds, effectively reconciling reliability with real-time responsiveness.

Technology Category

Application Category

📝 Abstract

We propose V-Droid, a mobile GUI task automation agent. Unlike previous mobile agents that utilize Large Language Models (LLMs) as generators to directly generate actions at each step, V-Droid employs LLMs as verifiers to evaluate candidate actions before making final decisions. To realize this novel paradigm, we introduce a comprehensive framework for constructing verifier-driven mobile agents: the discretized action space construction coupled with the prefilling-only workflow to accelerate the verification process, the pair-wise progress preference training to significantly enhance the verifier's decision-making capabilities, and the scalable human-agent joint annotation scheme to efficiently collect the necessary data at scale. V-Droid sets a new state-of-the-art task success rate across several public mobile task automation benchmarks: 59.5% on AndroidWorld, 38.3% on AndroidLab, and 49% on MobileAgentBench, surpassing existing agents by 9.5%, 2.1%, and 9%, respectively. Furthermore, V-Droid achieves an impressively low latency of 0.7 seconds per step, making it the first mobile agent capable of delivering near-real-time, effective decision-making capabilities.

Problem

Research questions and friction points this paper is trying to address.

Improves mobile GUI task automation using verifier-driven approach

Enhances decision-making with LLMs as verifiers, not generators

Achieves high task success rates and low latency

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLMs as verifiers for action evaluation

Discretized action space with prefilling workflow

Pair-wise progress preference training enhancement

🔎 Similar Papers

AppAgent v2: Advanced Agent for Flexible Mobile Interactions