GhostUI: Unveiling Hidden Interactions in Mobile UI

📅 2026-01-27

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

This work addresses the challenge posed by hidden interactions in mobile applications—such as long presses and swipes—that lack visual cues, hindering both user discoverability and the ability of vision-language models (VLMs) to accurately execute tasks. To tackle this issue, we present GhostUI, the first large-scale dataset specifically designed for modeling implicit gestures, comprising before-and-after screenshots, simplified view hierarchies, gesture metadata, and task descriptions. Building upon this dataset, we propose a VLM fine-tuning approach that leverages fine-grained UI metadata and contrastive learning between pre- and post-interaction states. Experimental results demonstrate that our fine-tuned model significantly outperforms baseline methods in both identifying hidden interactions and predicting post-interaction UI states, underscoring GhostUI’s pivotal role in advancing mobile UI automation.

Technology Category

Application Category

📝 Abstract

Modern mobile applications rely on hidden interactions--gestures without visual cues like long presses and swipes--to provide functionality without cluttering interfaces. While experienced users may discover these interactions through prior use or onboarding tutorials, their implicit nature makes them difficult for most users to uncover. Similarly, mobile agents--systems designed to automate tasks on mobile user interfaces, powered by vision language models (VLMs)--struggle to detect veiled interactions or determine actions for completing tasks. To address this challenge, we present GhostUI, a new dataset designed to enable the detection of hidden interactions in mobile applications. GhostUI provides before-and-after screenshots, simplified view hierarchies, gesture metadata, and task descriptions, allowing VLMs to better recognize concealed gestures and anticipate post-interaction states. Quantitative evaluations with VLMs show that models fine-tuned on GhostUI outperform baseline VLMs, particularly in predicting hidden interactions and inferring post-interaction screens, underscoring GhostUI's potential as a foundation for advancing mobile task automation.

Problem

Research questions and friction points this paper is trying to address.

hidden interactions

mobile UI

gesture detection

vision language models

task automation

Innovation

Methods, ideas, or system contributions that make the work stand out.

hidden interactions

mobile UI

vision language models