Foundation Model for Skeleton-Based Human Action Understanding

📅 2025-08-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing skeleton-based foundation models suffer from limited generalization and poor task adaptability, hindering fine-grained action recognition, dense action detection, and cross-domain transfer—key requirements for comprehensive action understanding. To address this, we propose the Unified Skeleton Dense Representation Learning (USDRL) framework, the first scalable, multi-task-compatible skeleton foundation model. USDRL features a spatiotemporal dual-stream Transformer encoder, a multi-granularity feature decorrelation module to enhance discriminability, and multi-view–multimodal self-supervised consistency training to achieve cross-domain feature disentanglement. Evaluated across nine task categories and 25 benchmarks, USDRL consistently outperforms state-of-the-art methods. It delivers significant improvements on three core metrics: skeleton-based action recognition, dense action detection, and cross-domain transfer. By unifying representation learning objectives and architectural design principles, USDRL establishes a generalizable paradigm for skeleton representation learning.

Technology Category

Application Category

📝 Abstract
Human action understanding serves as a foundational pillar in the field of intelligent motion perception. Skeletons serve as a modality- and device-agnostic representation for human modeling, and skeleton-based action understanding has potential applications in humanoid robot control and interaction. RED{However, existing works often lack the scalability and generalization required to handle diverse action understanding tasks. There is no skeleton foundation model that can be adapted to a wide range of action understanding tasks}. This paper presents a Unified Skeleton-based Dense Representation Learning (USDRL) framework, which serves as a foundational model for skeleton-based human action understanding. USDRL consists of a Transformer-based Dense Spatio-Temporal Encoder (DSTE), Multi-Grained Feature Decorrelation (MG-FD), and Multi-Perspective Consistency Training (MPCT). The DSTE module adopts two parallel streams to learn temporal dynamic and spatial structure features. The MG-FD module collaboratively performs feature decorrelation across temporal, spatial, and instance domains to reduce dimensional redundancy and enhance information extraction. The MPCT module employs both multi-view and multi-modal self-supervised consistency training. The former enhances the learning of high-level semantics and mitigates the impact of low-level discrepancies, while the latter effectively facilitates the learning of informative multimodal features. We perform extensive experiments on 25 benchmarks across across 9 skeleton-based action understanding tasks, covering coarse prediction, dense prediction, and transferred prediction. Our approach significantly outperforms the current state-of-the-art methods. We hope that this work would broaden the scope of research in skeleton-based action understanding and encourage more attention to dense prediction tasks.
Problem

Research questions and friction points this paper is trying to address.

Lack of scalable skeleton foundation model for diverse tasks
Need for unified framework in skeleton-based action understanding
Improving generalization across multiple action understanding tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformer-based Dense Spatio-Temporal Encoder
Multi-Grained Feature Decorrelation technique
Multi-Perspective Consistency Training method
🔎 Similar Papers
No similar papers found.
H
Hongsong Wang
School of Computer Science and Engineering, Southeast University, Nanjing 211189, China; Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of Education, China
W
Wanjiang Weng
School of Computer Science and Engineering, Southeast University, Nanjing 211189, China; Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of Education, China
J
Junbo Wang
School of Software, Northwestern Polytechnical University, Xi’an 710072, China
F
Fang Zhao
State Key Laboratory for Novel Software Technology and School of Intelligence Science and Technology, Nanjing University, Nanjing 210023, China
Guo-Sen Xie
Guo-Sen Xie
Professor, Nanjing University of Science and Technology
Computer VisionMachine Learning
Xin Geng
Xin Geng
School of Computer Science and Engineering, Southeast University
Artificial IntelligencePattern RecognitionMachine Learning
L
Liang Wang
New Laboratory of Pattern Recognition (NLPR), State Key Laboratory of Multimodal Artificial Intelligence Systems (MAIS), Institute of Automation, Chinese Academy of Sciences (CASIA); School of Artificial Intelligence, University of Chinese Academy of Sciences