Do Phone-Use Agents Respect Your Privacy?

📅 2026-04-01

📈 Citations: 0

✨ Influential: 0

career value

227K/year

🤖 AI Summary

This study addresses the lack of effective methods for evaluating privacy-compliant behaviors of mobile intelligent agents. The authors propose MyPhoneBench, a novel framework that operationalizes privacy compliance into quantifiable metrics grounded in the minimal privacy contract iMy—encompassing permission minimization, data minimization, and user-controllable memory. By integrating instrumented app simulation, rule-driven auditing, and multidimensional behavior tracking, MyPhoneBench establishes a reproducible and observable evaluation system for mobile agent privacy. Experiments across five state-of-the-art models, ten real-world applications, and 300 tasks reveal that all models excessively collect non-essential information. Notably, task success rates show no positive correlation with privacy compliance, indicating that reliance solely on success metrics significantly overestimates the actual privacy safety of deployed systems.

Technology Category

Application Category

📝 Abstract

We study whether phone-use agents respect privacy while completing benign mobile tasks. This question has remained hard to answer because privacy-compliant behavior is not operationalized for phone-use agents, and ordinary apps do not reveal exactly what data agents type into which form entries during execution. To make this question measurable, we introduce MyPhoneBench, a verifiable evaluation framework for privacy behavior in mobile agents. We operationalize privacy-respecting phone use as permissioned access, minimal disclosure, and user-controlled memory through a minimal privacy contract, iMy, and pair it with instrumented mock apps plus rule-based auditing that make unnecessary permission requests, deceptive re-disclosure, and unnecessary form filling observable and reproducible. Across five frontier models on 10 mobile apps and 300 tasks, we find that task success, privacy-compliant task completion, and later-session use of saved preferences are distinct capabilities, and no single model dominates all three. Evaluating success and privacy jointly reshuffles the model ordering relative to either metric alone. The most persistent failure mode across models is simple data minimization: agents still fill optional personal entries that the task does not require. These results show that privacy failures arise from over-helpful execution of benign tasks, and that success-only evaluation overestimates the deployment readiness of current phone-use agents. All code, mock apps, and agent trajectories are publicly available at~ https://github.com/tangzhy/MyPhoneBench.

Problem

Research questions and friction points this paper is trying to address.

privacy

mobile agents

data minimization

permissioned access

user-controlled memory

Innovation

Methods, ideas, or system contributions that make the work stand out.

privacy-compliant agents

mobile agent evaluation

data minimization