Skeleton-Snippet Contrastive Learning with Multiscale Feature Fusion for Action Localization

📅 2025-12-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of precise action boundary detection in skeleton-based temporal action localization, this paper proposes a snippet-discriminative self-supervised pre-training paradigm. Specifically, it introduces a non-overlapping skeleton snippet-level contrastive learning task to explicitly model fine-grained temporal dynamics between adjacent frames, and incorporates a U-shaped multi-scale feature fusion module that integrates temporally sensitive intermediate-layer features during decoding to enhance frame-level localization resolution. The method requires no human annotations and pioneers contrastive pre-training grounded in skeleton snippets, enabling end-to-end joint temporal-spatial modeling. It significantly outperforms existing skeleton-based contrastive approaches across all BABEL subsets and evaluation protocols. Furthermore, when pre-trained on NTU RGB+D and BABEL, the model achieves state-of-the-art transfer performance for temporal action localization on PKUMMD.

Technology Category

Application Category

📝 Abstract
The self-supervised pretraining paradigm has achieved great success in learning 3D action representations for skeleton-based action recognition using contrastive learning. However, learning effective representations for skeleton-based temporal action localization remains challenging and underexplored. Unlike video-level {action} recognition, detecting action boundaries requires temporally sensitive features that capture subtle differences between adjacent frames where labels change. To this end, we formulate a snippet discrimination pretext task for self-supervised pretraining, which densely projects skeleton sequences into non-overlapping segments and promotes features that distinguish them across videos via contrastive learning. Additionally, we build on strong backbones of skeleton-based action recognition models by fusing intermediate features with a U-shaped module to enhance feature resolution for frame-level localization. Our approach consistently improves existing skeleton-based contrastive learning methods for action localization on BABEL across diverse subsets and evaluation protocols. We also achieve state-of-the-art transfer learning performance on PKUMMD with pretraining on NTU RGB+D and BABEL.
Problem

Research questions and friction points this paper is trying to address.

Develops self-supervised pretraining for skeleton-based action localization
Enhances temporal sensitivity to detect precise action boundaries
Improves feature resolution for frame-level localization via multiscale fusion
Innovation

Methods, ideas, or system contributions that make the work stand out.

Snippet discrimination pretext task for self-supervised pretraining
U-shaped module fuses intermediate features for resolution
Contrastive learning distinguishes skeleton segments across videos
🔎 Similar Papers
No similar papers found.
Q
Qiushuo Cheng
School of Computer Science, University of Bristol
J
Jingjing Liu
School of Computer Science, University of Bristol
C
Catherine Morgan
North Bristol NHS Trust
Alan Whone
Alan Whone
Unknown affiliation
Majid Mirmehdi
Majid Mirmehdi
Professor of Computer Vision, FIAPR, FBMVA, University of Bristol
Computer Vision and Pattern Recognition