🤖 AI Summary
This study addresses the lack of interaction mechanisms in existing mobile GUI agents that simultaneously support transparency and multitasking. The authors propose a hybrid visual interaction model featuring an adaptive visual modality switching mechanism—the first of its kind for mobile GUI agents—that dynamically selects among Full UI, Partial UI, or GenUI visualization modalities based on task characteristics and user preferences. Leveraging Virtual Display technology, the system enables selective visual overlay during background execution. User studies demonstrate substantial improvements in human-agent interaction: 85.7% of participants preferred the proposed approach, which also achieved the highest usability score (PSSUQ = 1.94) and strongest intention to adopt (6.43 out of 7).
📝 Abstract
Mobile GUI agents can automate smartphone tasks by interacting directly with app interfaces, but how they should communicate with users during execution remains underexplored. Existing systems rely on two extremes: foreground execution, which maximizes transparency but prevents multitasking, and background execution, which supports multitasking but provides little visual awareness. Through iterative formative studies, we found that users prefer a hybrid model with just-in-time visual interaction, but the most effective visualization modality depends on the task. Motivated by this, we present AgentLens, a mobile GUI agent that adaptively uses three visual modalities during human-agent interaction: Full UI, Partial UI, and GenUI. AgentLens extends a standard mobile agent with adaptive communication actions and uses Virtual Display to enable background execution with selective visual overlays. In a controlled study with 21 participants, AgentLens was preferred by 85.7% of participants and achieved the highest usability (1.94 Overall PSSUQ) and adoption-intent (6.43/7).