UltraCUA: A Foundation Model for Computer Use Agents with Hybrid Action

📅 2025-10-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current computer-use agents (CUAs) are constrained by purely GUI-based operations—such as clicking, typing, and scrolling—that rely on fine-grained visual localization, leading to error propagation and low efficiency; moreover, they struggle to integrate programmatic interfaces like APIs or MCP servers. This work introduces a hybrid action mechanism that, for the first time, enables seamless coordination between low-level GUI interactions and high-level programmatic tool calls, significantly improving robustness and execution efficiency. Methodologically, we design an automated pipeline integrating a synthetic data engine, large-scale collection of hybrid-action trajectories, and a two-stage training paradigm combining supervised fine-tuning (SFT) with online reinforcement learning. On OSWorld, our approach achieves a 22% average improvement in task success rate and an 11% reduction in action steps; on WindowsAgentArena, it attains a 21.7% success rate, surpassing dedicated baseline models. Our core contribution is a unified modeling and training framework for a hybrid action space bridging GUI and programmatic interaction.

Technology Category

Application Category

📝 Abstract
Multimodal agents for computer use rely exclusively on primitive actions (click, type, scroll) that require accurate visual grounding and lengthy execution chains, leading to cascading failures and performance bottlenecks. While other agents leverage rich programmatic interfaces (APIs, MCP servers, tools), computer-use agents (CUAs) remain isolated from these capabilities. We present UltraCUA, a foundation model that bridges this gap through hybrid action -- seamlessly integrating GUI primitives with high-level programmatic tool calls. To achieve this, our approach comprises four key components: (1) an automated pipeline that scales programmatic tools from software documentation, open-source repositories, and code generation; (2) a synthetic data engine producing over 17,000 verifiable tasks spanning real-world computer-use scenarios; (3) a large-scale high-quality hybrid action trajectory collection with both low-level GUI actions and high-level programmatic tool calls; and (4) a two-stage training pipeline combining supervised fine-tuning with online reinforcement learning, enabling strategic alternation between low-level and high-level actions. Experiments with our 7B and 32B models demonstrate substantial improvements over state-of-the-art agents. On OSWorld, UltraCUA models achieve an average 22% relative improvement over base models, while being 11% faster in terms of steps. Out-of-domain evaluation on WindowsAgentArena shows our model reaches 21.7% success rate, outperforming baselines trained on Windows data. The hybrid action mechanism proves critical, reducing error propagation while maintaining execution efficiency.
Problem

Research questions and friction points this paper is trying to address.

Bridging GUI primitives with programmatic tools for computer agents
Reducing cascading failures in computer-use agents through hybrid actions
Enabling strategic alternation between low-level and high-level computer actions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid action integrates GUI primitives with programmatic tool calls
Automated pipeline scales programmatic tools from multiple sources
Two-stage training combines supervised fine-tuning with reinforcement learning
🔎 Similar Papers
No similar papers found.