Do LLMs Need to See Everything? A Benchmark and Study of Failures in LLM-driven Smartphone Automation using Screentext vs. Screenshots

πŸ“… 2026-04-20
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

225K/year
πŸ€– AI Summary
Current mobile agents for smartphone automation often suffer from low accuracy due to ambiguous instructions and task complexity, compounded by a lack of systematic failure analysis. This work introduces DailyDroid, the first multi-difficulty automation benchmark tailored to everyday smartphone usage, encompassing 75 tasks across 25 Android applications. We conduct 300 evaluations of GPT-4o and o4-mini under both text-only and multimodal (screenshot + OCR) input conditions. Furthermore, we propose the first failure taxonomy that integrates UI accessibility, input modality, and model architecture. Our analysis reveals that multimodal inputs yield only marginal performance gains, highlighting critical bottlenecks and pointing toward key directions for improving large language model–driven agents in real-world mobile environments.

Technology Category

Application Category

πŸ“ Abstract
With the rapid advancement of large language models (LLMs), mobile agents have emerged as promising tools for phone automation, simulating human interactions on screens to accomplish complex tasks. However, these agents often suffer from low accuracy, misinterpretation of user instructions, and failure on challenging tasks, with limited prior work examining why and where they fail. To address this, we introduce DailyDroid, a benchmark of 75 tasks in five scenarios across 25 Android apps, spanning three difficulty levels to mimic everyday smartphone use. We evaluate it using text-only and multimodal (text + screenshot) inputs on GPT-4o and o4-mini across 300 trials, revealing comparable performance with multimodal inputs yielding marginally higher success rates. Through in-depth failure analysis, we compile a handbook of common failures. Our findings reveal critical issues in UI accessibility, input modalities, and LLM/app design, offering implications for future mobile agents, applications, and UI development.
Problem

Research questions and friction points this paper is trying to address.

LLM-driven smartphone automation
failure analysis
UI accessibility
input modalities
mobile agents
Innovation

Methods, ideas, or system contributions that make the work stand out.

mobile automation
LLM benchmark
multimodal input
failure analysis
DailyDroid
πŸ”Ž Similar Papers