ProTAL: A Drag-and-Link Video Programming Framework for Temporal Action Localization

📅 2025-04-25

🏛️ International Conference on Human Factors in Computing Systems

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

Time-action localization (TAL) suffers from high annotation costs and limitations of existing data programming approaches in modeling spatiotemporal dynamics in videos. Method: We propose a drag-and-associate video programming framework, enabling users to visually specify complex action logic by dragging body-part or object nodes and establishing spatiotemporal links. This framework introduces the first symbolic event modeling and rule-driven label generation paradigm tailored for TAL, integrating human pose estimation, object detection, and graph-structured relational constraints to support efficient, large-scale weakly supervised label construction by non-experts. Contribution/Results: Our method achieves near fully supervised performance on standard TAL benchmarks. A user study confirms substantial reduction in annotation dependency and improved modeling efficiency. The framework establishes a scalable, accessible paradigm for domain-customized action recognition systems.

Technology Category

Application Category

📝 Abstract

Temporal Action Localization (TAL) aims to detect the start and end timestamps of actions in a video. However, the training of TAL models requires a substantial amount of manually annotated data. Data programming is an efficient method to create training labels with a series of human-defined labeling functions. However, its application in TAL faces difficulties of defining complex actions in the context of temporal video frames. In this paper, we propose ProTAL, a drag-and-link video programming framework for TAL. ProTAL enables users to define key events by dragging nodes representing body parts and objects and linking them to constrain the relations (direction, distance, etc.). These definitions are used to generate action labels for large-scale unlabelled videos. A semi-supervised method is then employed to train TAL models with such labels. We demonstrate the effectiveness of ProTAL through a usage scenario and a user study, providing insights into designing video programming framework.

Problem

Research questions and friction points this paper is trying to address.

Reducing manual annotation for Temporal Action Localization models

Simplifying complex action definition in video frames

Generating action labels for unlabelled videos efficiently

Innovation

Methods, ideas, or system contributions that make the work stand out.

Drag-and-link interface for defining key events

Generates action labels from user-defined relations

Semi-supervised training for Temporal Action Localization

🔎 Similar Papers

Open-Vocabulary Action Localization With Iterative Visual Prompting