OpenCUA: Open Foundations for Computer-Use Agents

📅 2025-08-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current computer-using agents (CUAs) rely on proprietary, closed systems, hindering open research and reproducibility. To address this, we introduce OpenCUA—the first open-source foundational framework for end-to-end task automation across operating systems and applications. Our method comprises three core components: (1) AgentNet, a large-scale, multi-platform task dataset covering Windows, macOS, and Linux, with annotations from 200+ applications; (2) a reflective, long-chain reasoning pipeline for robust data transformation; and (3) a vision-language model architecture integrating demonstration collection, state-action pair generation, and a scalable training framework. Evaluated on OSWorld-Verified, OpenCUA-32B achieves a new state-of-the-art average success rate of 34.8% among open models—substantially outperforming GPT-4o—and demonstrates strong cross-domain generalization and computational scalability.

Technology Category

Application Category

📝 Abstract
Vision-language models have demonstrated impressive capabilities as computer-use agents (CUAs) capable of automating diverse computer tasks. As their commercial potential grows, critical details of the most capable CUA systems remain closed. As these agents will increasingly mediate digital interactions and execute consequential decisions on our behalf, the research community needs access to open CUA frameworks to study their capabilities, limitations, and risks. To bridge this gap, we propose OpenCUA, a comprehensive open-source framework for scaling CUA data and foundation models. Our framework consists of: (1) an annotation infrastructure that seamlessly captures human computer-use demonstrations; (2) AgentNet, the first large-scale computer-use task dataset spanning 3 operating systems and 200+ applications and websites; (3) a scalable pipeline that transforms demonstrations into state-action pairs with reflective long Chain-of-Thought reasoning that sustain robust performance gains as data scales. Our end-to-end agent models demonstrate strong performance across CUA benchmarks. In particular, OpenCUA-32B achieves an average success rate of 34.8% on OSWorld-Verified, establishing a new state-of-the-art (SOTA) among open-source models and surpassing OpenAI CUA (GPT-4o). Further analysis confirms that our approach generalizes well across domains and benefits significantly from increased test-time computation. We release our annotation tool, datasets, code, and models to build open foundations for further CUA research.
Problem

Research questions and friction points this paper is trying to address.

OpenCUA addresses closed commercial CUA systems lacking transparency.
It provides open frameworks to study CUA capabilities and risks.
The solution includes scalable data, models, and annotation tools.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Open-source framework for vision-language CUA models
Large-scale dataset spanning multiple OS and applications
Scalable pipeline with reflective Chain-of-Thought reasoning
🔎 Similar Papers
No similar papers found.
X
Xinyuan Wang
XLANG Lab, University of Hong Kong
B
Bowen Wang
XLANG Lab, University of Hong Kong
Dunjie Lu
Dunjie Lu
Bachelor of Computer Science, Sun Yat-sen University
AIMLLLMVLMAgent
J
Junlin Yang
XLANG Lab, University of Hong Kong
Tianbao Xie
Tianbao Xie
University of Hong Kong
Artificial IntelligenceDeep LearningNatural Language Processing
Junli Wang
Junli Wang
Tsinghua University
Natural Language Processing
Jiaqi Deng
Jiaqi Deng
The University of Hong Kong
Deep LearningNatural Language Processing
X
Xiaole Guo
XLANG Lab, University of Hong Kong
Yiheng Xu
Yiheng Xu
University of Hong Kong
Natural Language Processing
Chen Henry Wu
Chen Henry Wu
PhD Student, Carnegie Mellon University
language modelsgenerative models
Zhennan Shen
Zhennan Shen
ShanghaiJiaoTong University
NLPLLMRoboticAI
Z
Zhuokai Li
XLANG Lab, University of Hong Kong
R
Ryan Li
XLANG Lab, University of Hong Kong
Xiaochuan Li
Xiaochuan Li
Carnegie Mellon University
Machine LearningNatural Language Processing
J
Junda Chen
XLANG Lab, University of Hong Kong
B
Boyuan Zheng
XLANG Lab, University of Hong Kong
P
Peihang Li
XLANG Lab, University of Hong Kong
Fangyu Lei
Fangyu Lei
Institute of Automation, Chinese Academy of Sciences
LLM-AgentCode GenerationText-to-SQLTable Reasoning
Ruisheng Cao
Ruisheng Cao
Shanghai Jiao Tong University
LLM Agenttext-to-SQLcode generationsemantic parsingdialogue systems
Y
Yeqiao Fu
XLANG Lab, University of Hong Kong
D
Dongchan Shin
Moonshot AI
M
Martin Shin
Moonshot AI
Jiarui Hu
Jiarui Hu
Zhejiang University
Computer Vision Robotics Computer Graphics
Yuyan Wang
Yuyan Wang
Assistant Professor of Marketing, Stanford Graduate School of Business
Machine LearningRecommender Systems and PersonalizationLong-Term Value OptimizationAlgorithmic
Jixuan Chen
Jixuan Chen
UC San Diego
Multimodal agentsNatural language processingMachine learning