SeeAction: Towards Reverse Engineering How-What-Where of HCI Actions from Screencasts for UI Automation

📅 2025-03-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of non-intrusively reverse-engineering user UI interactions from screencast videos to enable automated script generation. The proposed method introduces the first end-to-end screencast-to-structured-action parsing framework, employing a multi-task joint learning model that simultaneously classifies 11 interaction command types, localizes UI controls, and generates natural-language spatial descriptions (i.e., “command–control–location” triples). Technically, it integrates action-oriented video understanding, structured semantic parsing, and multimodal visual modeling. Evaluated on 7,260 real-world video–action pairs across five diverse applications (Microsoft Word, Zoom, Firefox, Photoshop, and Windows 10 Settings), the approach significantly improves UI-level bug reproduction efficiency. The tool is open-sourced and empirically validated. Key contributions include: (1) the first end-to-end paradigm for structured UI action parsing; (2) a novel multi-task collaborative modeling mechanism; and (3) a lightweight, cross-application generalizable video understanding architecture.

Technology Category

Application Category

📝 Abstract
UI automation is a useful technique for UI testing, bug reproduction, and robotic process automation. Recording user actions with an application assists rapid development of UI automation scripts, but existing recording techniques are intrusive, rely on OS or GUI framework accessibility support, or assume specific app implementations. Reverse engineering user actions from screencasts is non-intrusive, but a key reverse-engineering step is currently missing - recognizing human-understandable structured user actions ([command] [widget] [location]) from action screencasts. To fill the gap, we propose a deep learning-based computer vision model that can recognize 11 commands and 11 widgets, and generate location phrases from action screencasts, through joint learning and multi-task learning. We label a large dataset with 7260 video-action pairs, which record user interactions with Word, Zoom, Firefox, Photoshop, and Windows 10 Settings. Through extensive experiments, we confirm the effectiveness and generality of our model, and demonstrate the usefulness of a screencast-to-action-script tool built upon our model for bug reproduction.
Problem

Research questions and friction points this paper is trying to address.

Reverse engineering user actions from screencasts for UI automation
Recognizing structured user actions from action screencasts
Developing a deep learning model for command, widget, and location recognition
Innovation

Methods, ideas, or system contributions that make the work stand out.

Deep learning-based computer vision model
Recognizes commands, widgets, locations
Generates UI automation scripts from screencasts
🔎 Similar Papers
No similar papers found.