🤖 AI Summary
Existing surveys on 3D skeleton-based action recognition predominantly focus on model architecture design, neglecting foundational components across the entire task pipeline—particularly preprocessing, modeling, and evaluation—resulting in a lack of systematic analysis of these critical stages. Method: We propose the first task-oriented survey paradigm, establishing a comprehensive, end-to-end framework encompassing modality derivation, data augmentation, feature extraction, and spatiotemporal modeling. This framework holistically integrates state-of-the-art techniques—including graph convolutional networks, spatiotemporal Transformers, Mamba architectures, LLM prompt tuning, and diffusion-based generation. Contribution/Results: We systematically curate and unify 12 mainstream benchmark datasets, standardize performance evaluation for over 40 algorithms, and fill the long-standing gap in systematic analysis of foundational pipeline components. The survey delivers a reproducible, extensible roadmap and practical guidelines to advance the field.
📝 Abstract
With the inherent advantages of skeleton representation, 3D skeleton-based action recognition has become a prominent topic in the field of computer vision. However, previous reviews have predominantly adopted a model-oriented perspective, often neglecting the fundamental steps involved in skeleton-based action recognition. This oversight tends to ignore key components of skeleton-based action recognition beyond model design and has hindered deeper, more intrinsic understanding of the task. To bridge this gap, our review aims to address these limitations by presenting a comprehensive, task-oriented framework for understanding skeleton-based action recognition. We begin by decomposing the task into a series of sub-tasks, placing particular emphasis on preprocessing steps such as modality derivation and data augmentation. The subsequent discussion delves into critical sub-tasks, including feature extraction and spatio-temporal modeling techniques. Beyond foundational action recognition networks, recently advanced frameworks such as hybrid architectures, Mamba models, large language models (LLMs), and generative models have also been highlighted. Finally, a comprehensive overview of public 3D skeleton datasets is presented, accompanied by an analysis of state-of-the-art algorithms evaluated on these benchmarks. By integrating task-oriented discussions, comprehensive examinations of sub-tasks, and an emphasis on the latest advancements, our review provides a fundamental and accessible structured roadmap for understanding and advancing the field of 3D skeleton-based action recognition.