🤖 AI Summary
This work addresses the limitations of existing self-supervised skeleton-based action recognition methods, which suffer from incomplete representations and restricted generalization due to biased masking strategies. To overcome this, we propose an asymmetric spatiotemporal masking mechanism that leverages complementary masking of high/low-motion frames and high/low-activity joints, coupled with a learnable feature alignment module to capture more comprehensive action representations. Furthermore, we integrate knowledge distillation to significantly enhance deployment efficiency without compromising performance. Extensive experiments on benchmarks such as NTU RGB+D demonstrate that our approach improves fine-tuning accuracy by 2.7–4.4% and boosts transfer learning performance by up to 5.9%. The distilled model achieves a 91.4% reduction in parameters and a threefold increase in inference speed.
📝 Abstract
Self-supervised learning (SSL) has shown remarkable success in skeleton-based action recognition by leveraging data augmentations to learn meaningful representations. However, existing SSL methods rely on data augmentations that predominantly focus on masking high-motion frames and high-degree joints such as joints with degree 3 or 4. This results in biased and incomplete feature representations that struggle to generalize across varied motion patterns. To address this, we propose Asymmetric Spatio-temporal Masking (ASMa) for Skeleton Action Representation Learning, a novel combination of masking to learn a full spectrum of spatio-temporal dynamics inherent in human actions. ASMa employs two complementary masking strategies: one that selectively masks high-degree joints and low-motion, and another that masks low-degree joints and high-motion frames. These masking strategies ensure a more balanced and comprehensive skeleton representation learning. Furthermore, we introduce a learnable feature alignment module to effectively align the representations learned from both masked views. To facilitate deployment in resource-constrained settings and on low-resource devices, we compress the learned and aligned representation into a lightweight model using knowledge distillation. Extensive experiments on NTU RGB+D 60, NTU RGB+D 120, and PKU-MMD datasets demonstrate that our approach outperforms existing SSL methods with an average improvement of 2.7-4.4% in fine-tuning and up to 5.9% in transfer learning to noisy datasets and achieves competitive performance compared to fully supervised baselines. Our distilled model achieves 91.4% parameter reduction and 3x faster inference on edge devices while maintaining competitive accuracy, enabling practical deployment in resource-constrained scenarios.