ScaleCUA: Scaling Open-Source Computer Use Agents with Cross-Platform Data

📅 2025-09-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current computer-using agents (CUAs) suffer from limited scalability and cross-platform generalization due to the scarcity of large-scale, open-source, cross-platform GUI automation data and foundational models. To address this, we introduce the first open-source CUA framework for GUI automation: (1) we construct a high-quality, multi-task dataset spanning six major operating systems and design a closed-loop human-in-the-loop pipeline for scalable data generation; (2) we propose unified action-space modeling and cross-platform instruction tuning, leveraging vision-language models (VLMs) for generalized training. Experiments demonstrate state-of-the-art performance: +26.6 on WebArena-Lite-v2 and +10.7 on ScreenSpot-Pro; 94.4% accuracy on MMBench-GUI L1-Hard and 60.6% on OSWorld-G. Our approach establishes new SOTA across all benchmarks and—critically—enables seamless cross-platform transfer and efficient scalability for CUAs, a first in the field.

Technology Category

Application Category

📝 Abstract
Vision-Language Models (VLMs) have enabled computer use agents (CUAs) that operate GUIs autonomously, showing great potential, yet progress is limited by the lack of large-scale, open-source computer use data and foundation models. In this work, we introduce ScaleCUA, a step toward scaling open-source CUAs. It offers a large-scale dataset spanning 6 operating systems and 3 task domains, built via a closed-loop pipeline uniting automated agents with human experts. Trained on this scaled-up data, ScaleCUA can operate seamlessly across platforms. Specifically, it delivers strong gains over baselines (+26.6 on WebArena-Lite-v2, +10.7 on ScreenSpot-Pro) and sets new state-of-the-art results (94.4% on MMBench-GUI L1-Hard, 60.6% on OSWorld-G, 47.4% on WebArena-Lite-v2). These findings underscore the power of data-driven scaling for general-purpose computer use agents. We will release data, models, and code to advance future research: https://github.com/OpenGVLab/ScaleCUA.
Problem

Research questions and friction points this paper is trying to address.

Addressing lack of large-scale open-source computer use data
Building cross-platform computer use agents with vision-language models
Creating scalable automated GUI operation across operating systems
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-platform data collection pipeline
Vision-language model training on scaled data
Seamless multi-OS agent operation