π€ AI Summary
Current mobile agents for smartphone automation often suffer from low accuracy due to ambiguous instructions and task complexity, compounded by a lack of systematic failure analysis. This work introduces DailyDroid, the first multi-difficulty automation benchmark tailored to everyday smartphone usage, encompassing 75 tasks across 25 Android applications. We conduct 300 evaluations of GPT-4o and o4-mini under both text-only and multimodal (screenshot + OCR) input conditions. Furthermore, we propose the first failure taxonomy that integrates UI accessibility, input modality, and model architecture. Our analysis reveals that multimodal inputs yield only marginal performance gains, highlighting critical bottlenecks and pointing toward key directions for improving large language modelβdriven agents in real-world mobile environments.
π Abstract
With the rapid advancement of large language models (LLMs), mobile agents have emerged as promising tools for phone automation, simulating human interactions on screens to accomplish complex tasks. However, these agents often suffer from low accuracy, misinterpretation of user instructions, and failure on challenging tasks, with limited prior work examining why and where they fail. To address this, we introduce DailyDroid, a benchmark of 75 tasks in five scenarios across 25 Android apps, spanning three difficulty levels to mimic everyday smartphone use. We evaluate it using text-only and multimodal (text + screenshot) inputs on GPT-4o and o4-mini across 300 trials, revealing comparable performance with multimodal inputs yielding marginally higher success rates. Through in-depth failure analysis, we compile a handbook of common failures. Our findings reveal critical issues in UI accessibility, input modalities, and LLM/app design, offering implications for future mobile agents, applications, and UI development.