π€ AI Summary
Existing self-supervised contrastive learning methods for skeleton-based action recognition typically treat skeletal regions uniformly and rely on FIFO queues for negative sample storage, leading to loss of motion details and suboptimal negative sample selection. To address these issues, this paper proposes a dominance-gamingβbased self-supervised contrastive learning framework. First, it models dynamic dominance relationships between positive and negative samples to enhance representation discriminability and semantic consistency. Second, it introduces spatiotemporal dual-dimensional weighted region localization and region-level data augmentation to preserve critical motion structures. Third, it incorporates an entropy-driven hard-negative memory bank with dynamic updating to improve negative sample quality. Extensive experiments demonstrate state-of-the-art performance: on NTU RGB+D, improvements of 1.1% and 2.3% are achieved on NTU120 X-Sub and X-Set benchmarks, respectively; on PKU-MMD Part II, the method achieves a 1.9% gain over prior art.
π Abstract
Existing self-supervised contrastive learning methods for skeleton-based action recognition often process all skeleton regions uniformly, and adopt a first-in-first-out (FIFO) queue to store negative samples, which leads to motion information loss and non-optimal negative sample selection. To address these challenges, this paper proposes Dominance-Game Contrastive Learning network for skeleton-based action Recognition (DoGCLR), a self-supervised framework based on game theory. DoGCLR models the construction of positive and negative samples as a dynamic Dominance Game, where both sample types interact to reach an equilibrium that balances semantic preservation and discriminative strength. Specifically, a spatio-temporal dual weight localization mechanism identifies key motion regions and guides region-wise augmentations to enhance motion diversity while maintaining semantics. In parallel, an entropy-driven dominance strategy manages the memory bank by retaining high entropy (hard) negatives and replacing low-entropy (weak) ones, ensuring consistent exposure to informative contrastive signals. Extensive experiments are conducted on NTU RGB+D and PKU-MMD datasets. On NTU RGB+D 60 X-Sub/X-View, DoGCLR achieves 81.1%/89.4% accuracy, and on NTU RGB+D 120 X-Sub/X-Set, DoGCLR achieves 71.2%/75.5% accuracy, surpassing state-of-the-art methods by 0.1%, 2.7%, 1.1%, and 2.3%, respectively. On PKU-MMD Part I/Part II, DoGCLR performs comparably to the state-of-the-art methods and achieves a 1.9% higher accuracy on Part II, highlighting its strong robustness on more challenging scenarios.