CutVerse: A Compositional GUI Agents Benchmark for Media Post-Production Editing

📅 2026-05-19
📈 Citations: 0
Influential: 0
📄 PDF

career value

210K/year
🤖 AI Summary
This work addresses the lack of systematic evaluation for GUI-based intelligent agents in professional creative workflows—particularly in complex, long-horizon, multimodal interaction tasks such as media post-production. To bridge this gap, the authors introduce the first benchmark specifically designed for media editing, encompassing 186 real-world tasks across seven professional software applications. They also propose a lightweight parser that converts screen recordings and low-level interaction logs into structured action trajectories, enabling composable and extensible evaluation of long-horizon multimodal tasks. Experimental results reveal that current agents achieve only a 36.0% task success rate on this benchmark, highlighting significant deficiencies in their long-horizon reliability and domain-specific planning capabilities.
📝 Abstract
While GUI agents have made significant progress in web navigation and basic operating system tasks, their capabilities in professional creative workflows remain largely underexplored. To bridge this gap, we introduce Cutverse, a benchmark designed to systematically evaluate autonomous GUI agents in realistic media post-production environments. We curate expert demonstrations across 7 professional applications (e.g., Premiere Pro, Photoshop), covering 186 complex, long-horizon tasks grounded in authentic editing workflows, involving dense multimodal interfaces and tightly coupled interaction sequences. To support scalable evaluation, we develop a lightweight parser that transforms raw screen recordings and low-level interaction logs into structured, compositional GUI action trajectories with precise grounding. Extensive evaluations reveal that existing agents achieve only 36.0\% task success on realistic media editing tasks, underscoring the challenges posed by complex, long-horizon media post-production workflows in our benchmark.While current models demonstrate promising spatial grounding, multimodal alignment, and coordinated action execution, they remain limited in long-horizon reliability and domain-specific planning.
Problem

Research questions and friction points this paper is trying to address.

GUI agents
media post-production
benchmark
long-horizon tasks
creative workflows
Innovation

Methods, ideas, or system contributions that make the work stand out.

GUI agents
media post-production
compositional benchmark
multimodal interaction
long-horizon tasks