Video2GUI: Synthesizing Large-Scale Interaction Trajectories for Generalized GUI Agent Pretraining

📅 2026-05-14
📈 Citations: 0
Influential: 0
📄 PDF

career value

207K/year
🤖 AI Summary
This work addresses the limited generalization of GUI agents, which stems from the scarcity of large-scale, diverse real-world interaction data; existing datasets rely on costly manual annotation and offer narrow coverage. To overcome this, the authors propose Video2GUI, a novel framework that automatically extracts GUI interaction trajectories with environmental context from unlabeled web videos. Leveraging a coarse-to-fine video filtering pipeline, multimodal vision-language models (Qwen2.5-VL and Mimo-VL), and automated trajectory structuring techniques, they construct WildGUI—a dataset of 12 million trajectories spanning over 1,500 applications. Evaluation shows that training on WildGUI significantly boosts agent performance across multiple GUI benchmarks by 5% to 20%, achieving or surpassing state-of-the-art results.
📝 Abstract
Recent advances in multimodal large language models have driven growing interest in graphical user interface (GUI) agents, yet their generalization remains constrained by the scarcity of large-scale training data spanning diverse real-world applications. Existing datasets rely heavily on costly manual annotations and are typically confined to narrow domains. To address this challenge, we propose Video2GUI, a fully automated framework that extracts grounded GUI interaction trajectories directly from unlabeled Internet videos. Video2GUI employs a coarse-to-fine filtering strategy to identify high-quality GUI tutorial videos and convert them into structured agent trajectories. Applying this pipeline to 500 million video metadata entries, we construct WildGUI, a large-scale dataset containing 12 million interaction trajectories spanning over 1,500 applications and websites. Pre-training Qwen2.5-VL and Mimo-VL on WildGUI yields consistent improvements of 5-20% across multiple GUI grounding and action benchmarks, matching or surpassing state-of-the-art performance. We will release both the WildGUI dataset and the Video2GUI pipeline to support future research of GUI agents.
Problem

Research questions and friction points this paper is trying to address.

GUI agents
generalization
large-scale training data
interaction trajectories
data scarcity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Video2GUI
GUI agent
interaction trajectory
automated data extraction
large-scale pretraining