3D Skeleton-Based Action Recognition: A Review

📅 2025-06-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing surveys on 3D skeleton-based action recognition predominantly focus on model architecture design, neglecting foundational components across the entire task pipeline—particularly preprocessing, modeling, and evaluation—resulting in a lack of systematic analysis of these critical stages. Method: We propose the first task-oriented survey paradigm, establishing a comprehensive, end-to-end framework encompassing modality derivation, data augmentation, feature extraction, and spatiotemporal modeling. This framework holistically integrates state-of-the-art techniques—including graph convolutional networks, spatiotemporal Transformers, Mamba architectures, LLM prompt tuning, and diffusion-based generation. Contribution/Results: We systematically curate and unify 12 mainstream benchmark datasets, standardize performance evaluation for over 40 algorithms, and fill the long-standing gap in systematic analysis of foundational pipeline components. The survey delivers a reproducible, extensible roadmap and practical guidelines to advance the field.

Technology Category

Application Category

📝 Abstract
With the inherent advantages of skeleton representation, 3D skeleton-based action recognition has become a prominent topic in the field of computer vision. However, previous reviews have predominantly adopted a model-oriented perspective, often neglecting the fundamental steps involved in skeleton-based action recognition. This oversight tends to ignore key components of skeleton-based action recognition beyond model design and has hindered deeper, more intrinsic understanding of the task. To bridge this gap, our review aims to address these limitations by presenting a comprehensive, task-oriented framework for understanding skeleton-based action recognition. We begin by decomposing the task into a series of sub-tasks, placing particular emphasis on preprocessing steps such as modality derivation and data augmentation. The subsequent discussion delves into critical sub-tasks, including feature extraction and spatio-temporal modeling techniques. Beyond foundational action recognition networks, recently advanced frameworks such as hybrid architectures, Mamba models, large language models (LLMs), and generative models have also been highlighted. Finally, a comprehensive overview of public 3D skeleton datasets is presented, accompanied by an analysis of state-of-the-art algorithms evaluated on these benchmarks. By integrating task-oriented discussions, comprehensive examinations of sub-tasks, and an emphasis on the latest advancements, our review provides a fundamental and accessible structured roadmap for understanding and advancing the field of 3D skeleton-based action recognition.
Problem

Research questions and friction points this paper is trying to address.

Addresses gaps in model-oriented reviews of skeleton action recognition
Proposes task-oriented framework for comprehensive understanding of the field
Highlights preprocessing, feature extraction, and latest modeling techniques
Innovation

Methods, ideas, or system contributions that make the work stand out.

Task-oriented framework for skeleton action recognition
Emphasizes preprocessing like modality derivation
Highlights advanced models including LLMs
🔎 Similar Papers
No similar papers found.
M
Mengyuan Liu
State Key Laboratory of General Artificial Intelligence, Peking University, Shenzhen Graduate School, Shenzhen, China
H
Hong Liu
State Key Laboratory of General Artificial Intelligence, Peking University, Shenzhen Graduate School, Shenzhen, China
Q
Qianshuo Hu
State Key Laboratory of General Artificial Intelligence, Peking University, Shenzhen Graduate School, Shenzhen, China
B
Bin Ren
Department of Information Engineering and Computer Science (DISI), University of Trento, Trento, Italy
Junsong Yuan
Junsong Yuan
State University of New York at Buffalo
computer visionvideo analyticsaction and gesture analysismultimediapattern recognition
Jiaying Lin
Jiaying Lin
Peking University
Computer VisionMultimodal
Jiajun Wen
Jiajun Wen
Sun Yat-sen University
Human Action Recognition、Embodied Intelligence