Do Phone-Use Agents Respect Your Privacy?

📅 2026-04-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the lack of effective methods for evaluating privacy-compliant behaviors of mobile intelligent agents. The authors propose MyPhoneBench, a novel framework that operationalizes privacy compliance into quantifiable metrics grounded in the minimal privacy contract iMy—encompassing permission minimization, data minimization, and user-controllable memory. By integrating instrumented app simulation, rule-driven auditing, and multidimensional behavior tracking, MyPhoneBench establishes a reproducible and observable evaluation system for mobile agent privacy. Experiments across five state-of-the-art models, ten real-world applications, and 300 tasks reveal that all models excessively collect non-essential information. Notably, task success rates show no positive correlation with privacy compliance, indicating that reliance solely on success metrics significantly overestimates the actual privacy safety of deployed systems.
📝 Abstract
We study whether phone-use agents respect privacy while completing benign mobile tasks. This question has remained hard to answer because privacy-compliant behavior is not operationalized for phone-use agents, and ordinary apps do not reveal exactly what data agents type into which form entries during execution. To make this question measurable, we introduce MyPhoneBench, a verifiable evaluation framework for privacy behavior in mobile agents. We operationalize privacy-respecting phone use as permissioned access, minimal disclosure, and user-controlled memory through a minimal privacy contract, iMy, and pair it with instrumented mock apps plus rule-based auditing that make unnecessary permission requests, deceptive re-disclosure, and unnecessary form filling observable and reproducible. Across five frontier models on 10 mobile apps and 300 tasks, we find that task success, privacy-compliant task completion, and later-session use of saved preferences are distinct capabilities, and no single model dominates all three. Evaluating success and privacy jointly reshuffles the model ordering relative to either metric alone. The most persistent failure mode across models is simple data minimization: agents still fill optional personal entries that the task does not require. These results show that privacy failures arise from over-helpful execution of benign tasks, and that success-only evaluation overestimates the deployment readiness of current phone-use agents. All code, mock apps, and agent trajectories are publicly available at~ https://github.com/tangzhy/MyPhoneBench.
Problem

Research questions and friction points this paper is trying to address.

privacy
mobile agents
data minimization
permissioned access
user-controlled memory
Innovation

Methods, ideas, or system contributions that make the work stand out.

privacy-compliant agents
mobile agent evaluation
data minimization
verifiable benchmarking
minimal privacy contract
🔎 Similar Papers
No similar papers found.
Zhengyang Tang
Zhengyang Tang
CUHKSZ
Large Language ModelsMathematical ReasoningInformation Retrieval
Ke Ji
Ke Ji
PhD student, The Chinese University of Hong Kong, Shenzhen
Large Language ModelsAgentMathematical Reasoning
X
Xidong Wang
The Chinese University of Hong Kong, Shenzhen
Zihan Ye
Zihan Ye
University of Chinese Academy of Sciences (UCAS)
Deep LearningZero-shot LearningComputer VisionGenerative Model
Xinyuan Wang
Xinyuan Wang
Phd Student at HKU
AIAgentNLP
Y
Yiduo Guo
Hunyuan Team, Tencent
Ziniu Li
Ziniu Li
The Chinese University of Hong Kong, Shenzhen
Machine LearningReinforcement LearningLarge Language Models
Chenxin Li
Chenxin Li
The Chinese University of Hong Kong
Multimodal LLMAgentWorld Model
J
Jingyuan Hu
The Chinese University of Hong Kong, Shenzhen
Shunian Chen
Shunian Chen
The Chinese University of Hong Kong, Shenzhen
Large Language ModelsMultimodal Large Language ModelsAgent
T
Tongxu Luo
The Chinese University of Hong Kong, Shenzhen
J
Jiaxi Bi
The Chinese University of Hong Kong, Shenzhen
Zeyu Qin
Zeyu Qin
Hong Kong University of Science and Technology
Machine LearningDeep LearningScalable OversightAI Safety
Shaobo Wang
Shaobo Wang
Shanghai Jiao Tong University
Large Language ModelData-Centric AIData SynthesisData SelectionExplainable AI
Xin Lai
Xin Lai
ByteDance
Multimodal UnderstandingMultimodal Agent
Pengyuan Lyu
Pengyuan Lyu
Huazhong University of Science and Technology
computer vision
Junyi Li
Junyi Li
The University of Hong Kong
Computer VisionMultimodal UnderstandingMultimodal Agent
Can Xu
Can Xu
Tencent Hunyuan X
natural language processing
Chengquan Zhang
Chengquan Zhang
Unknown affiliation
computer visionapplication of deep learning
Han Hu
Han Hu
Distinguished Scientist, Tencent Hunyuan
Computer VisionDeep LearningMachine Learning
M
Ming Yan
The Chinese University of Hong Kong, Shenzhen
Benyou Wang
Benyou Wang
Assistant Professor, The Chinese University of Hong Kong, Shenzhen
large language modelsnatural language processinginformation retrievalapplied machine learning