🤖 AI Summary
To address the scarcity of large-scale, high-quality, and low-cost training data for cross-platform navigation tasks in mobile operating systems, this paper introduces the first fully automated framework for generating video-to-task datasets. Leveraging 20K publicly available instructional videos, the method employs OCR-driven scene detection, high-precision UI element recognition, multi-step action modeling, and frame-level semantic alignment to automatically extract 313K interaction frames with structured annotations, forming the MONDAY dataset—covering diverse, real-world GUI scenarios across multiple platforms. Key innovations include a robust cross-interface multi-step action recognition mechanism and a scalable end-to-end pipeline. A vision agent pre-trained on MONDAY achieves an average accuracy improvement of 18.11 percentage points on unseen OS platforms, demonstrating substantial gains in cross-platform generalization capability.
📝 Abstract
Recent advancements in Large Language Models (LLMs) and Vision-Language Models (VLMs) have sparked significant interest in developing GUI visual agents. We introduce MONDAY (Mobile OS Navigation Task Dataset for Agents from YouTube), a large-scale dataset of 313K annotated frames from 20K instructional videos capturing diverse real-world mobile OS navigation across multiple platforms. Models that include MONDAY in their pre-training phases demonstrate robust cross-platform generalization capabilities, consistently outperforming models trained on existing single OS datasets while achieving an average performance gain of 18.11%p on an unseen mobile OS platform. To enable continuous dataset expansion as mobile platforms evolve, we present an automated framework that leverages publicly available video content to create comprehensive task datasets without manual annotation. Our framework comprises robust OCR-based scene detection (95.04% F1score), near-perfect UI element detection (99.87% hit ratio), and novel multi-step action identification to extract reliable action sequences across diverse interface configurations. We contribute both the MONDAY dataset and our automated collection framework to facilitate future research in mobile OS navigation.