ASMa: Asymmetric Spatio-temporal Masking for Skeleton Action Representation Learning

📅 2026-02-05

🏛️ Trans. Mach. Learn. Res.

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

This work addresses the limitations of existing self-supervised skeleton-based action recognition methods, which suffer from incomplete representations and restricted generalization due to biased masking strategies. To overcome this, we propose an asymmetric spatiotemporal masking mechanism that leverages complementary masking of high/low-motion frames and high/low-activity joints, coupled with a learnable feature alignment module to capture more comprehensive action representations. Furthermore, we integrate knowledge distillation to significantly enhance deployment efficiency without compromising performance. Extensive experiments on benchmarks such as NTU RGB+D demonstrate that our approach improves fine-tuning accuracy by 2.7–4.4% and boosts transfer learning performance by up to 5.9%. The distilled model achieves a 91.4% reduction in parameters and a threefold increase in inference speed.

Technology Category

Application Category

📝 Abstract

Self-supervised learning (SSL) has shown remarkable success in skeleton-based action recognition by leveraging data augmentations to learn meaningful representations. However, existing SSL methods rely on data augmentations that predominantly focus on masking high-motion frames and high-degree joints such as joints with degree 3 or 4. This results in biased and incomplete feature representations that struggle to generalize across varied motion patterns. To address this, we propose Asymmetric Spatio-temporal Masking (ASMa) for Skeleton Action Representation Learning, a novel combination of masking to learn a full spectrum of spatio-temporal dynamics inherent in human actions. ASMa employs two complementary masking strategies: one that selectively masks high-degree joints and low-motion, and another that masks low-degree joints and high-motion frames. These masking strategies ensure a more balanced and comprehensive skeleton representation learning. Furthermore, we introduce a learnable feature alignment module to effectively align the representations learned from both masked views. To facilitate deployment in resource-constrained settings and on low-resource devices, we compress the learned and aligned representation into a lightweight model using knowledge distillation. Extensive experiments on NTU RGB+D 60, NTU RGB+D 120, and PKU-MMD datasets demonstrate that our approach outperforms existing SSL methods with an average improvement of 2.7-4.4% in fine-tuning and up to 5.9% in transfer learning to noisy datasets and achieves competitive performance compared to fully supervised baselines. Our distilled model achieves 91.4% parameter reduction and 3x faster inference on edge devices while maintaining competitive accuracy, enabling practical deployment in resource-constrained scenarios.

Problem

Research questions and friction points this paper is trying to address.

skeleton-based action recognition

self-supervised learning

spatio-temporal masking

feature representation bias

motion pattern generalization

Innovation

Methods, ideas, or system contributions that make the work stand out.

asymmetric masking

skeleton action representation

self-supervised learning