ProTAL: A Drag-and-Link Video Programming Framework for Temporal Action Localization

📅 2025-04-25
🏛️ International Conference on Human Factors in Computing Systems
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Time-action localization (TAL) suffers from high annotation costs and limitations of existing data programming approaches in modeling spatiotemporal dynamics in videos. Method: We propose a drag-and-associate video programming framework, enabling users to visually specify complex action logic by dragging body-part or object nodes and establishing spatiotemporal links. This framework introduces the first symbolic event modeling and rule-driven label generation paradigm tailored for TAL, integrating human pose estimation, object detection, and graph-structured relational constraints to support efficient, large-scale weakly supervised label construction by non-experts. Contribution/Results: Our method achieves near fully supervised performance on standard TAL benchmarks. A user study confirms substantial reduction in annotation dependency and improved modeling efficiency. The framework establishes a scalable, accessible paradigm for domain-customized action recognition systems.

Technology Category

Application Category

📝 Abstract
Temporal Action Localization (TAL) aims to detect the start and end timestamps of actions in a video. However, the training of TAL models requires a substantial amount of manually annotated data. Data programming is an efficient method to create training labels with a series of human-defined labeling functions. However, its application in TAL faces difficulties of defining complex actions in the context of temporal video frames. In this paper, we propose ProTAL, a drag-and-link video programming framework for TAL. ProTAL enables users to define key events by dragging nodes representing body parts and objects and linking them to constrain the relations (direction, distance, etc.). These definitions are used to generate action labels for large-scale unlabelled videos. A semi-supervised method is then employed to train TAL models with such labels. We demonstrate the effectiveness of ProTAL through a usage scenario and a user study, providing insights into designing video programming framework.
Problem

Research questions and friction points this paper is trying to address.

Reducing manual annotation for Temporal Action Localization models
Simplifying complex action definition in video frames
Generating action labels for unlabelled videos efficiently
Innovation

Methods, ideas, or system contributions that make the work stand out.

Drag-and-link interface for defining key events
Generates action labels from user-defined relations
Semi-supervised training for Temporal Action Localization
🔎 Similar Papers
No similar papers found.
Yuchen He
Yuchen He
Shanghai Jiaotong University
online learning
J
Jianbing Lv
School of Software Technology, Zhejiang University, Hangzhou, Zhejiang, China
L
Liqi Cheng
State Key Lab of CAD&CG, Zhejiang University, Hangzhou, Zhejiang, China
L
Lingyu Meng
State Key Lab of CAD&CG, Zhejiang University, Hangzhou, Zhejiang, China
Dazhen Deng
Dazhen Deng
Zhejiang University
Visual AnalyticsXAIHuman-Computer Interaction
Yingcai Wu
Yingcai Wu
Professor at the State Key Lab of CAD&CG, Zhejiang University
Visual AnalyticsSports AnalyticsUrban Computing