VideoAgentTrek: Computer Use Pretraining from Unlabeled Videos

📅 2025-10-22

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

To address the high cost and scalability bottleneck of GUI agent training—particularly its heavy reliance on labor-intensive, manually annotated behavioral trajectories—this paper proposes Video2Action, the first framework capable of automatically reconstructing high-fidelity GUI action sequences from unlabeled screen-capture videos. Methodologically, it integrates a video grounding model to precisely localize temporal action boundaries and jointly employs an action content recognition model to extract structured parameters (e.g., click coordinates, text inputs). The agent is optimized via continual pretraining followed by supervised fine-tuning. Evaluated on 39,000 YouTube tutorial videos, Video2Action auto-generates 1.52 million interaction steps. It improves task success rate on OSWorld-Verified by 70% (from 9.3% to 15.8%) and achieves 69.3% step accuracy on AgentNetBench. This work establishes a novel, human-annotation-free, and scalable pretraining paradigm for GUI agents.

Technology Category

Application Category

📝 Abstract

Training computer-use agents requires massive amounts of GUI interaction data, but manually annotating action trajectories at scale is prohibitively expensive. We present VideoAgentTrek, a scalable pipeline that automatically mines training data from publicly available screen-recorded videos at web scale, eliminating the need for manual annotation. Our approach addresses a key challenge: raw videos contain implicit demonstrations but lack explicit action labels. To solve this, we develop Video2Action, an inverse dynamics module (IDM) with two components: (1) a video grounding model that detects and localizes GUI actions with precise temporal boundaries and context, and (2) an action-content recognizer that extracts structured parameters like click coordinates and typed text with high fidelity. Applied to 39,000 YouTube tutorial videos, our pipeline generates 1.52 million interaction steps automatically. We leverage this data through continued pretraining followed by supervised fine-tuning. On OSWorld-Verified, our approach improves task success rates from 9.3% (SFT-only baseline) to 15.8%, a 70% relative improvement. On AgentNetBench, step accuracy increases from 64.1% to 69.3%. Our results demonstrate that passive internet videos can be transformed into high-quality supervision for computer-use agents, providing a scalable alternative to expensive manual annotation.

Problem

Research questions and friction points this paper is trying to address.

Automating GUI action annotation from unlabeled screen-recorded videos

Extracting precise interaction parameters like clicks and typed text

Generating scalable training data for computer-use agents without manual labeling

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automatically mines training data from screen-recorded videos

Develops inverse dynamics module to extract GUI actions

Generates interaction steps through pretraining and fine-tuning

🔎 Similar Papers

R+X: Retrieval and Execution from Everyday Human Videos