CUA-Suite: Massive Human-annotated Video Demonstrations for Computer-Use Agents

📅 2026-03-25

📈 Citations: 0

✨ Influential: 0

career value

235K/year

🤖 AI Summary

This work addresses the scarcity of high-quality, continuous human-computer interaction videos that has constrained research on computer-using agents. To overcome the limitations of existing datasets—primarily sparse screenshots lacking dynamic interaction context—we introduce CUA-Suite, a comprehensive ecosystem comprising 55 hours (6 million frames at 30 fps) of continuous screen recordings, precise cursor trajectories, and multi-level reasoning annotations across 10,000 tasks in 87 professional software applications. We also release the UI-Vision benchmark and the GroundCUA interface grounding dataset. As the first large-scale resource of continuous human demonstration videos for desktop interaction, CUA-Suite preserves full temporal dynamics and supports lossless conversion into diverse agent input formats, enabling advances in screen understanding, sequential control, and video-based reward modeling. Preliminary evaluation reveals a 60% failure rate of current foundational action models on professional desktop tasks, underscoring the dataset’s challenge and utility.

Technology Category

Application Category

📝 Abstract

Computer-use agents (CUAs) hold great promise for automating complex desktop workflows, yet progress toward general-purpose agents is bottlenecked by the scarcity of continuous, high-quality human demonstration videos. Recent work emphasizes that continuous video, not sparse screenshots, is the critical missing ingredient for scaling these agents. However, the largest existing open dataset, ScaleCUA, contains only 2 million screenshots, equating to less than 20 hours of video. To address this bottleneck, we introduce CUA-Suite, a large-scale ecosystem of expert video demonstrations and dense annotations for professional desktop computer-use agents. At its core is VideoCUA, which provides approximately 10,000 human-demonstrated tasks across 87 diverse applications with continuous 30 fps screen recordings, kinematic cursor traces, and multi-layerfed reasoning annotations, totaling approximately 55 hours and 6 million frames of expert video. Unlike sparse datasets that capture only final click coordinates, these continuous video streams preserve the full temporal dynamics of human interaction, forming a superset of information that can be losslessly transformed into the formats required by existing agent frameworks. CUA-Suite further provides two complementary resources: UI-Vision, a rigorous benchmark for evaluating grounding and planning capabilities in CUAs, and GroundCUA, a large-scale grounding dataset with 56K annotated screenshots and over 3.6 million UI element annotations. Preliminary evaluation reveals that current foundation action models struggle substantially with professional desktop applications (~60% task failure rate). Beyond evaluation, CUA-Suite's rich multimodal corpus supports emerging research directions including generalist screen parsing, continuous spatial control, video-based reward modeling, and visual world models. All data and models are publicly released.

Problem

Research questions and friction points this paper is trying to address.

computer-use agents

human demonstration videos

continuous video data

desktop automation

agent training data scarcity

Innovation

Methods, ideas, or system contributions that make the work stand out.

continuous video demonstrations

computer-use agents

dense multimodal annotations