Spatial-Temporal Decoupling Contrastive Learning for Skeleton-based Human Action Recognition

📅 2023-12-23
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Existing skeleton-based action recognition methods suffer from entangled spatiotemporal features and ambiguous semantic representations, lacking explicit modeling of intra-class variability and inter-class relationships—leading to insufficient discriminability. To address this, we propose a spatiotemporal decoupled contrastive learning framework: (1) it explicitly separates global skeleton sequence features into spatially and temporally specific branches for the first time; (2) it introduces an attention-weighted contrastive loss to precisely capture semantic relationships across samples; and (3) it is plug-and-play with zero inference overhead. The method is compatible with mainstream encoders—including HCN, AGCN, CTR-GCN, and Hyperformer—and achieves an average accuracy improvement of 3.2% across four backbone architectures on NTU-60, NTU-120, and NW-UCLA benchmarks, demonstrating strong generality and effectiveness.
📝 Abstract
Skeleton-based action recognition is a central task in human-computer interaction. However, most previous methods suffer from two issues: (i) semantic ambiguity arising from spatial-temporal information mixture; and (ii) overlooking the explicit exploitation of the latent data distributions (i.e., the intra-class variations and inter-class relations), thereby leading to sub-optimum solutions of the skeleton encoders. To mitigate this, we propose a spatial-temporal decoupling contrastive learning (STD-CL) framework to obtain discriminative and semantically distinct representations from the sequences, which can be incorporated into various previous skeleton encoders and can be removed when testing. Specifically, we decouple the global features into spatial-specific and temporal-specific features to reduce the spatial-temporal coupling of features. Furthermore, to explicitly exploit the latent data distributions, we employ the attentive features to contrastive learning, which models the cross-sequence semantic relations by pulling together the features from the positive pairs and pushing away the negative pairs. Extensive experiments show that STD-CL with four various skeleton encoders (HCN, 2S-AGCN, CTR-GCN, and Hyperformer) achieves solid improvements on NTU60, NTU120, and NW-UCLA benchmarks. The code will be released soon.
Problem

Research questions and friction points this paper is trying to address.

Enhance 3D skeleton action recognition accuracy
Address intra-class and inter-class data distribution issues
Improve discriminative spatiotemporal feature representation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Contrastive learning enhances spatiotemporal skeleton representations.
Decomposes features into spatial and temporal dimensions.
Attentive features model cross-sequence semantic relations.
🔎 Similar Papers
No similar papers found.
S
Shaojie Zhang
Beijing University of Posts and Telecommunications
J
Jianqin Yin
Beijing University of Posts and Telecommunications, Queen Mary School Hainan
Y
Yonghao Dang
Beijing University of Posts and Telecommunications