MS-CLR: Multi-Skeleton Contrastive Learning for Human Action Recognition

📅 2025-08-20

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

Existing skeleton-based contrastive learning methods rely on a single skeleton representation, limiting generalization due to discrepancies in joint definitions and anatomical coverage across datasets. To address this, we propose MS-CLR, the first multi-skeleton contrastive learning framework featuring an explicit multi-skeleton alignment mechanism that jointly models structural diversity—including varying joint counts, topologies, and anatomical coverage—thereby enhancing cross-dataset robustness for unsupervised action recognition. Built upon an enhanced ST-GCN architecture, MS-CLR introduces a unified multi-skeleton representation and adaptive modules to support heterogeneous skeleton inputs and end-to-end contrastive learning. On NTU RGB+D 60 and 120, MS-CLR significantly outperforms strong baselines; its multi-skeleton ensemble achieves new state-of-the-art performance. These results empirically validate that explicitly modeling structural diversity is critical for improving representation generalization in skeleton-based action understanding.

Technology Category

Application Category

📝 Abstract

Contrastive learning has gained significant attention in skeleton-based action recognition for its ability to learn robust representations from unlabeled data. However, existing methods rely on a single skeleton convention, which limits their ability to generalize across datasets with diverse joint structures and anatomical coverage. We propose Multi-Skeleton Contrastive Learning (MS-CLR), a general self-supervised framework that aligns pose representations across multiple skeleton conventions extracted from the same sequence. This encourages the model to learn structural invariances and capture diverse anatomical cues, resulting in more expressive and generalizable features. To support this, we adapt the ST-GCN architecture to handle skeletons with varying joint layouts and scales through a unified representation scheme. Experiments on the NTU RGB+D 60 and 120 datasets demonstrate that MS-CLR consistently improves performance over strong single-skeleton contrastive learning baselines. A multi-skeleton ensemble further boosts performance, setting new state-of-the-art results on both datasets.

Problem

Research questions and friction points this paper is trying to address.

Addresses limited generalization across skeleton datasets

Aligns pose representations from multiple skeleton conventions

Learns structural invariances and diverse anatomical cues

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-skeleton contrastive learning framework

Unified representation for varying joint layouts

Aligns pose representations across skeleton conventions

🔎 Similar Papers

Spatial-Temporal Decoupling Contrastive Learning for Skeleton-based Human Action Recognition