Pose-Aware Weakly-Supervised Action Segmentation

πŸ“… 2025-04-08
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This paper addresses weakly supervised action segmentation in long instructional videos, requiring only video-level labels. The proposed pose-aware method leverages pose priors during training but operates entirely pose-free at inferenceβ€”no pose input is needed. Its core contributions are threefold: (1) a pose-aware contrastive loss that distills action-relevant pose knowledge using trainable pose priors; (2) complete decoupling from pose dependencies at inference time, ensuring practical deployability; and (3) a boundary-aware modeling module integrated within a multi-backbone-compatible architecture, supporting both online and offline deployment. Evaluated on multiple benchmarks, the method surpasses state-of-the-art approaches, achieving significant improvements in overall segmentation accuracy and precise action boundary localization. It demonstrates strong generalization across diverse instructional video domains and high practical utility for real-world applications.

Technology Category

Application Category

πŸ“ Abstract
Understanding human behavior is an important problem in the pursuit of visual intelligence. A challenge in this endeavor is the extensive and costly effort required to accurately label action segments. To address this issue, we consider learning methods that demand minimal supervision for segmentation of human actions in long instructional videos. Specifically, we introduce a weakly-supervised framework that uniquely incorporates pose knowledge during training while omitting its use during inference, thereby distilling pose knowledge pertinent to each action component. We propose a pose-inspired contrastive loss as a part of the whole weakly-supervised framework which is trained to distinguish action boundaries more effectively. Our approach, validated through extensive experiments on representative datasets, outperforms previous state-of-the-art (SOTA) in segmenting long instructional videos under both online and offline settings. Additionally, we demonstrate the framework's adaptability to various segmentation backbones and pose extractors across different datasets.
Problem

Research questions and friction points this paper is trying to address.

Reducing costly action segment labeling in videos
Weakly-supervised human action segmentation using pose knowledge
Improving action boundary detection in instructional videos
Innovation

Methods, ideas, or system contributions that make the work stand out.

Pose-aware weakly-supervised action segmentation framework
Pose-inspired contrastive loss for boundary distinction
Adaptable to various backbones and pose extractors