Foundation Model for Skeleton-Based Human Action Understanding

📅 2025-08-17

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

Existing skeleton-based foundation models suffer from limited generalization and poor task adaptability, hindering fine-grained action recognition, dense action detection, and cross-domain transfer—key requirements for comprehensive action understanding. To address this, we propose the Unified Skeleton Dense Representation Learning (USDRL) framework, the first scalable, multi-task-compatible skeleton foundation model. USDRL features a spatiotemporal dual-stream Transformer encoder, a multi-granularity feature decorrelation module to enhance discriminability, and multi-view–multimodal self-supervised consistency training to achieve cross-domain feature disentanglement. Evaluated across nine task categories and 25 benchmarks, USDRL consistently outperforms state-of-the-art methods. It delivers significant improvements on three core metrics: skeleton-based action recognition, dense action detection, and cross-domain transfer. By unifying representation learning objectives and architectural design principles, USDRL establishes a generalizable paradigm for skeleton representation learning.

Technology Category

Application Category

📝 Abstract

Human action understanding serves as a foundational pillar in the field of intelligent motion perception. Skeletons serve as a modality- and device-agnostic representation for human modeling, and skeleton-based action understanding has potential applications in humanoid robot control and interaction. RED{However, existing works often lack the scalability and generalization required to handle diverse action understanding tasks. There is no skeleton foundation model that can be adapted to a wide range of action understanding tasks}. This paper presents a Unified Skeleton-based Dense Representation Learning (USDRL) framework, which serves as a foundational model for skeleton-based human action understanding. USDRL consists of a Transformer-based Dense Spatio-Temporal Encoder (DSTE), Multi-Grained Feature Decorrelation (MG-FD), and Multi-Perspective Consistency Training (MPCT). The DSTE module adopts two parallel streams to learn temporal dynamic and spatial structure features. The MG-FD module collaboratively performs feature decorrelation across temporal, spatial, and instance domains to reduce dimensional redundancy and enhance information extraction. The MPCT module employs both multi-view and multi-modal self-supervised consistency training. The former enhances the learning of high-level semantics and mitigates the impact of low-level discrepancies, while the latter effectively facilitates the learning of informative multimodal features. We perform extensive experiments on 25 benchmarks across across 9 skeleton-based action understanding tasks, covering coarse prediction, dense prediction, and transferred prediction. Our approach significantly outperforms the current state-of-the-art methods. We hope that this work would broaden the scope of research in skeleton-based action understanding and encourage more attention to dense prediction tasks.

Problem

Research questions and friction points this paper is trying to address.

Lack of scalable skeleton foundation model for diverse tasks

Need for unified framework in skeleton-based action understanding

Improving generalization across multiple action understanding tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformer-based Dense Spatio-Temporal Encoder

Multi-Grained Feature Decorrelation technique

Multi-Perspective Consistency Training method

🔎 Similar Papers

Self-Supervised Skeleton-Based Action Representation Learning: A Benchmark and Beyond