ShowUI-Aloha: Human-Taught GUI Agent

πŸ“… 2026-01-12
πŸ›οΈ arXiv.org
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the challenge of automating complex GUI tasks in real-world desktop environments, where high-quality training data is scarce and learning effectively from unstructured human demonstration videos remains difficult. We propose the first end-to-end framework that automatically transforms unlabeled screen recordings into structured, executable tasks through four synergistic modules: recording, understanding, planning, and execution. Our approach integrates multimodal interaction capture (including mouse, keyboard, and scrolling), vision-language semantic parsing, and context-aware task planning, enabling safe, operating system–level execution of action sequences. This framework establishes the first complete mapping from raw human demonstrations to semantically meaningful task descriptions and executable behaviors, significantly enhancing the generalization and execution capabilities of GUI agents in authentic settings.

Technology Category

Application Category

πŸ“ Abstract
Graphical User Interfaces (GUIs) are central to human-computer interaction, yet automating complex GUI tasks remains a major challenge for autonomous agents, largely due to a lack of scalable, high-quality training data. While recordings of human demonstrations offer a rich data source, they are typically long, unstructured, and lack annotations, making them difficult for agents to learn from.To address this, we introduce ShowUI-Aloha, a comprehensive pipeline that transforms unstructured, in-the-wild human screen recordings from desktop environments into structured, actionable tasks. Our framework includes four key components: A recorder that captures screen video along with precise user interactions like mouse clicks, keystrokes, and scrolls. A learner that semantically interprets these raw interactions and the surrounding visual context, translating them into descriptive natural language captions. A planner that reads the parsed demonstrations, maintains task states, and dynamically formulates the next high-level action plan based on contextual reasoning. An executor that faithfully carries out these action plans at the OS level, performing precise clicks, drags, text inputs, and window operations with safety checks and real-time feedback. Together, these components provide a scalable solution for collecting and parsing real-world human data, demonstrating a viable path toward building general-purpose GUI agents that can learn effectively from simply observing humans.
Problem

Research questions and friction points this paper is trying to address.

GUI automation
human demonstration
training data
unstructured screen recordings
autonomous agents
Innovation

Methods, ideas, or system contributions that make the work stand out.

GUI agent
human demonstration
structured task learning
multimodal interaction parsing
OS-level execution
πŸ”Ž Similar Papers
No similar papers found.
Y
Yichun Zhang
Show Lab, National University of Singapore
X
Xiangwu Guo
Show Lab, National University of Singapore
Y
Yauhong Goh
Show Lab, National University of Singapore
J
Jessica Hu
Show Lab, National University of Singapore
Zhiheng Chen
Zhiheng Chen
University of California, Irvine
X
Xin Wang
Show Lab, National University of Singapore
Difei Gao
Difei Gao
National U. of Singapore; Institute of Computing Technology, Chinese Academy of Sciences
Artificial IntelligenceAI AgentVision and Language
M
Mike Zheng Shou
Show Lab, National University of Singapore