SPA-Bench: A Comprehensive Benchmark for SmartPhone Agent Evaluation

📅 2024-10-19
🏛️ arXiv.org
📈 Citations: 3
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the lack of fair, systematic evaluation benchmarks for multimodal large language model (MLLM)-driven smartphone agents. To this end, we introduce the first real-device benchmark for smartphone agents supporting cross-lingual (Chinese/English) evaluation across high-coverage daily-life scenarios. We design an Android real-device interaction framework enabling plug-and-play integration of multiple agents. Furthermore, we develop a seven-dimensional automated evaluation pipeline—assessing task success rate, action steps, memory/CPU usage, and other resource overheads—to jointly quantify functional correctness and execution efficiency. Extensive evaluation across 10+ state-of-the-art agents reveals four fundamental bottlenecks: UI understanding bias, imprecise action grounding, deficient long-term memory, and high operational cost. Our contributions include a reproducible benchmark, an open-source toolchain, and concrete, actionable insights for advancing trustworthy and efficient smartphone agent development.

Technology Category

Application Category

📝 Abstract
Smartphone agents are increasingly important for helping users control devices efficiently, with (Multimodal) Large Language Model (MLLM)-based approaches emerging as key contenders. Fairly comparing these agents is essential but challenging, requiring a varied task scope, the integration of agents with different implementations, and a generalisable evaluation pipeline to assess their strengths and weaknesses. In this paper, we present SPA-Bench, a comprehensive SmartPhone Agent Benchmark designed to evaluate (M)LLM-based agents in an interactive environment that simulates real-world conditions. SPA-Bench offers three key contributions: (1) A diverse set of tasks covering system and third-party apps in both English and Chinese, focusing on features commonly used in daily routines; (2) A plug-and-play framework enabling real-time agent interaction with Android devices, integrating over ten agents with the flexibility to add more; (3) A novel evaluation pipeline that automatically assesses agent performance across multiple dimensions, encompassing seven metrics related to task completion and resource consumption. Our extensive experiments across tasks and agents reveal challenges like interpreting mobile user interfaces, action grounding, memory retention, and execution costs. We propose future research directions to ease these difficulties, moving closer to real-world smartphone agent applications.
Problem

Research questions and friction points this paper is trying to address.

Evaluating smartphone agents' efficiency
Comparing multimodal language model approaches
Assessing agent performance in real-world conditions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal Large Language Model
plug-and-play framework
automatic evaluation pipeline
🔎 Similar Papers
No similar papers found.