🤖 AI Summary
This study investigates the practical safety implications of AI agents performing high-risk or irreversible actions on mobile UIs, aiming to enhance both operational safety and impact interpretability. We introduce the first expert-driven, fine-grained taxonomy for classifying the impacts of mobile UI operations. To support rigorous evaluation, we propose an impact-aware data annotation framework and a benchmark—built upon synthetic and real-world screen interaction traces—and conduct systematic human annotation and LLM capability assessment under zero-shot and chain-of-thought (CoT) settings. Experimental results demonstrate that our taxonomy significantly improves LLMs’ accuracy in impact reasoning (+18.7%), yet substantial challenges remain in recognizing complex, multi-step impacts (F1 < 0.42). Our work establishes the first reproducible, quantifiable evaluation paradigm specifically designed for assessing the impact of AI agent actions on mobile UIs—providing a foundational resource for developing safer, more interpretable mobile AI agents.
📝 Abstract
With advances in generative AI, there is increasing work towards creating autonomous agents that can manage daily tasks by operating user interfaces (UIs). While prior research has studied the mechanics of how AI agents might navigate UIs and understand UI structure, the effects of agents and their autonomous actions-particularly those that may be risky or irreversible-remain under-explored. In this work, we investigate the real-world impacts and consequences of mobile UI actions taken by AI agents. We began by developing a taxonomy of the impacts of mobile UI actions through a series of workshops with domain experts. Following this, we conducted a data synthesis study to gather realistic mobile UI screen traces and action data that users perceive as impactful. We then used our impact categories to annotate our collected data and data repurposed from existing mobile UI navigation datasets. Our quantitative evaluations of different large language models (LLMs) and variants demonstrate how well different LLMs can understand the impacts of mobile UI actions that might be taken by an agent. We show that our taxonomy enhances the reasoning capabilities of these LLMs for understanding the impacts of mobile UI actions, but our findings also reveal significant gaps in their ability to reliably classify more nuanced or complex categories of impact.