🤖 AI Summary
Time-action localization (TAL) suffers from high annotation costs and limitations of existing data programming approaches in modeling spatiotemporal dynamics in videos. Method: We propose a drag-and-associate video programming framework, enabling users to visually specify complex action logic by dragging body-part or object nodes and establishing spatiotemporal links. This framework introduces the first symbolic event modeling and rule-driven label generation paradigm tailored for TAL, integrating human pose estimation, object detection, and graph-structured relational constraints to support efficient, large-scale weakly supervised label construction by non-experts. Contribution/Results: Our method achieves near fully supervised performance on standard TAL benchmarks. A user study confirms substantial reduction in annotation dependency and improved modeling efficiency. The framework establishes a scalable, accessible paradigm for domain-customized action recognition systems.
📝 Abstract
Temporal Action Localization (TAL) aims to detect the start and end timestamps of actions in a video. However, the training of TAL models requires a substantial amount of manually annotated data. Data programming is an efficient method to create training labels with a series of human-defined labeling functions. However, its application in TAL faces difficulties of defining complex actions in the context of temporal video frames. In this paper, we propose ProTAL, a drag-and-link video programming framework for TAL. ProTAL enables users to define key events by dragging nodes representing body parts and objects and linking them to constrain the relations (direction, distance, etc.). These definitions are used to generate action labels for large-scale unlabelled videos. A semi-supervised method is then employed to train TAL models with such labels. We demonstrate the effectiveness of ProTAL through a usage scenario and a user study, providing insights into designing video programming framework.