KnowU-Bench: Towards Interactive, Proactive, and Personalized Mobile Agent Evaluation

πŸ“… 2026-04-09
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing benchmarks struggle to evaluate mobile agents’ ability to actively infer user preferences, determine appropriate intervention timing, and negotiate consent during real-world interactions. To address this gap, this work introduces an online evaluation platform built on a reproducible Android simulation environment, which for the first time integrates preference inference, proactive intervention calibration, and GUI execution into a unified assessment pipeline. The platform employs an LLM-driven structured user simulator to generate realistic dialogues and consent-negotiation behaviors, using behavioral logs instead of explicit user profiles to enable multi-turn preference elicitation and proactive service decisions. A hybrid evaluation protocol combining rule-based validation and LLM-as-a-Judge reveals that state-of-the-art models achieve less than 50% task success under ambiguous instructions, highlighting preference acquisition and intervention calibration as critical bottlenecks in current agent capabilities.
πŸ“ Abstract
Personalized mobile agents that infer user preferences and calibrate proactive assistance hold great promise as everyday digital assistants, yet existing benchmarks fail to capture what this requires. Prior work evaluates preference recovery from static histories or intent prediction from fixed contexts. Neither tests whether an agent can elicit missing preferences through interaction, nor whether it can decide when to intervene, seek consent, or remain silent in a live GUI environment. We introduce KnowU-Bench, an online benchmark for personalized mobile agents built on a reproducible Android emulation environment, covering 42 general GUI tasks, 86 personalized tasks, and 64 proactive tasks. Unlike prior work that treats user preferences as static context, KnowU-Bench hides the user profile from the agent and exposes only behavioral logs, forcing genuine preference inference rather than context lookup. To support multi-turn preference elicitation, it instantiates an LLM-driven user simulator grounded in structured profiles, enabling realistic clarification dialogues and proactive consent handling. Beyond personalization, KnowU-Bench provides comprehensive evaluation of the complete proactive decision chain, including grounded GUI execution, consent negotiation, and post-rejection restraint, evaluated through a hybrid protocol combining rule-based verification with LLM-as-a-Judge scoring. Our experiments reveal a striking degradation: agents that excel at explicit task execution fall below 50% under vague instructions requiring user preference inference or intervention calibration, even for frontier models like Claude Sonnet 4.6. The core bottlenecks are not GUI navigation but preference acquisition and intervention calibration, exposing a fundamental gap between competent interface operation and trustworthy personal assistance.
Problem

Research questions and friction points this paper is trying to address.

personalized mobile agents
preference inference
proactive assistance
interactive evaluation
user consent
Innovation

Methods, ideas, or system contributions that make the work stand out.

personalized mobile agents
proactive assistance
preference inference
user simulation
interactive evaluation benchmark
πŸ”Ž Similar Papers
No similar papers found.