🤖 AI Summary
Existing self-supervised skeleton-based action learning research lacks systematic surveys and is largely confined to single-paradigm, single-task (i.e., action recognition only) settings, resulting in limited representation generalizability. Method: This paper presents the first systematic survey and unified benchmark for self-supervised skeleton action representation learning, exposing fundamental limitations of single-level modeling and task isolation; it proposes a multi-granularity joint pre-training framework featuring a novel spatiotemporal cooperative multi-objective self-supervised strategy that unifies contextual modeling, generative learning, and contrastive learning. Contribution/Results: Evaluated on NTU, PKU, and NW-UCLA benchmarks, our approach comprehensively surpasses state-of-the-art methods and achieves superior generalization across diverse downstream tasks—including action recognition, retrieval, detection, and few-shot learning—demonstrating unprecedented cross-task robustness and transferability.
📝 Abstract
Self-supervised learning (SSL), which aims to learn meaningful prior representations from unlabeled data, has been proven effective for skeleton-based action understanding. Different from the image domain, skeleton data possesses sparser spatial structures and diverse representation forms, with the absence of background clues and the additional temporal dimension, presenting new challenges for spatial-temporal motion pretext task design. Recently, many endeavors have been made for skeleton-based SSL, achieving remarkable progress. However, a systematic and thorough review is still lacking. In this paper, we conduct, for the first time, a comprehensive survey on self-supervised skeleton-based action representation learning. Following the taxonomy of context-based, generative learning, and contrastive learning approaches, we make a thorough review and benchmark of existing works and shed light on the future possible directions. Remarkably, our investigation demonstrates that most SSL works rely on the single paradigm, learning representations of a single level, and are evaluated on the action recognition task solely, which leaves the generalization power of skeleton SSL models under-explored. To this end, a novel and effective SSL method for skeleton is further proposed, which integrates versatile representation learning objectives of different granularity, substantially boosting the generalization capacity for multiple skeleton downstream tasks. Extensive experiments under three large-scale datasets demonstrate our method achieves superior generalization performance on various downstream tasks, including recognition, retrieval, detection, and few-shot learning.