GUI Odyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices

📅 2024-06-12

🏛️ arXiv.org

📈 Citations: 47

✨ Influential: 7

career value

227K/year

🤖 AI Summary

Existing GUI navigation agents are typically trained on single-application datasets, exhibiting poor cross-application generalization. This work addresses this limitation by introducing GUI Odyssey, the first large-scale cross-application GUI navigation dataset—comprising 7,735 multi-device navigation trajectories across 201 applications and 1,400 application combinations. We formally define and release the first systematic cross-app navigation benchmark. To capture long-range operational dependencies, we propose a history resampling module. Furthermore, we develop OdysseyAgent, a multimodal agent built upon Qwen-VL that jointly models interface screenshots and sequential action histories. Experiments demonstrate that OdysseyAgent achieves absolute accuracy improvements of +1.44% (in-domain) and +2.29% (out-of-domain) over fine-tuned Qwen-VL, and +55.49% (in-domain) and +48.14% (out-of-domain) over zero-shot GPT-4V. These results significantly advance research on automated, multi-step, cross-application mobile task execution.

Technology Category

Application Category

📝 Abstract

Smartphone users often navigate across multiple applications (apps) to complete tasks such as sharing content between social media platforms. Autonomous Graphical User Interface (GUI) navigation agents can enhance user experience in communication, entertainment, and productivity by streamlining workflows and reducing manual intervention. However, prior GUI agents often trained with datasets comprising simple tasks that can be completed within a single app, leading to poor performance in cross-app navigation. To address this problem, we introduce GUI Odyssey, a comprehensive dataset for training and evaluating cross-app navigation agents. GUI Odyssey consists of 7,735 episodes from 6 mobile devices, spanning 6 types of cross-app tasks, 201 apps, and 1.4K app combos. Leveraging GUI Odyssey, we developed OdysseyAgent, a multimodal cross-app navigation agent by fine-tuning the Qwen-VL model with a history resampling module. Extensive experiments demonstrate OdysseyAgent's superior accuracy compared to existing models. For instance, OdysseyAgent surpasses fine-tuned Qwen-VL and zero-shot GPT-4V by 1.44% and 55.49% in-domain accuracy, and 2.29% and 48.14% out-of-domain accuracy on average. The dataset and code will be released in url{https://github.com/OpenGVLab/GUI-Odyssey}.

Problem

Research questions and friction points this paper is trying to address.

Addresses poor performance in cross-app GUI navigation

Introduces GUIOdyssey dataset for multi-app mobile tasks

Enhances agent reasoning for complex cross-app workflows

Innovation

Methods, ideas, or system contributions that make the work stand out.

GUIOdyssey dataset for cross-app navigation

Semantic reasoning annotations enhance model cognition

OdysseyAgent with history resampler improves navigation

🔎 Similar Papers

MobileViews: A Large-Scale Mobile GUI Dataset