MobileViews: A Large-Scale Mobile GUI Dataset

📅 2024-09-22
🏛️ arXiv.org
📈 Citations: 13
Influential: 0
📄 PDF
🤖 AI Summary
Existing mobile GUI datasets (e.g., Rico) suffer from severe limitations in scale, application coverage, and annotation fidelity, hindering the training of on-device GUI agents. To address this, we introduce MobileViews—the largest high-fidelity mobile GUI dataset to date—comprising 603,000 screenshot–ViewHierarchy pairs spanning 21,400 modern Android applications. We propose a novel LLM-augmented automated UI traversal framework, leveraging 200+ parallel virtual Android instances (accumulating 81,600 device-hours) to enable synchronized, high-fidelity capture and annotation of screenshots and hierarchical UI structures. MobileViews substantially advances multimodal large models’ performance on screen understanding tasks, consistently outperforming Rico across all benchmarks. Our results empirically validate that large-scale, high-fidelity, fine-grained GUI data is critical for developing capable on-device intelligent assistants.

Technology Category

Application Category

📝 Abstract
Mobile screen assistants help smartphone users by interpreting mobile screens and responding to user requests. The excessive private information on mobile screens necessitates small, on-device models to power these assistants. However, there is a lack of a comprehensive and large-scale mobile screen dataset with high diversity to train and enhance these models. To efficiently construct such a dataset, we utilize an LLM-enhanced automatic app traversal tool to minimize human intervention. We then employ two SoC clusters to provide high-fidelity mobile environments, including more than 200 Android instances to parallelize app interactions. By utilizing the system to collect mobile screens over 81,600 device-hours, we introduce MobileViews, the largest mobile screen dataset, which includes over 600K screenshot-view hierarchy pairs from more than 20K modern Android apps. We demonstrate the effectiveness of MobileViews by training SOTA multimodal LLMs that power mobile screen assistants on it and the Rico dataset, which was introduced seven years ago. Evaluation results on mobile screen tasks show that the scale and quality of mobile screens in MobileViews demonstrate significant advantages over Rico in augmenting mobile screen assistants.
Problem

Research questions and friction points this paper is trying to address.

Addresses limited scale and quality of mobile GUI datasets
Provides automated framework for collecting diverse mobile interface data
Enhances visual language model performance for GUI grounding tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Utilizing mobile SoC clusters for high-fidelity environments
Implementing VLM-enhanced automatic application traversal framework
Collecting million-scale GUI pairs with minimal human intervention
🔎 Similar Papers
No similar papers found.