🤖 AI Summary
Existing skeleton-based foundation models suffer from limited generalization and poor task adaptability, hindering fine-grained action recognition, dense action detection, and cross-domain transfer—key requirements for comprehensive action understanding. To address this, we propose the Unified Skeleton Dense Representation Learning (USDRL) framework, the first scalable, multi-task-compatible skeleton foundation model. USDRL features a spatiotemporal dual-stream Transformer encoder, a multi-granularity feature decorrelation module to enhance discriminability, and multi-view–multimodal self-supervised consistency training to achieve cross-domain feature disentanglement. Evaluated across nine task categories and 25 benchmarks, USDRL consistently outperforms state-of-the-art methods. It delivers significant improvements on three core metrics: skeleton-based action recognition, dense action detection, and cross-domain transfer. By unifying representation learning objectives and architectural design principles, USDRL establishes a generalizable paradigm for skeleton representation learning.
📝 Abstract
Human action understanding serves as a foundational pillar in the field of intelligent motion perception. Skeletons serve as a modality- and device-agnostic representation for human modeling, and skeleton-based action understanding has potential applications in humanoid robot control and interaction. RED{However, existing works often lack the scalability and generalization required to handle diverse action understanding tasks. There is no skeleton foundation model that can be adapted to a wide range of action understanding tasks}. This paper presents a Unified Skeleton-based Dense Representation Learning (USDRL) framework, which serves as a foundational model for skeleton-based human action understanding. USDRL consists of a Transformer-based Dense Spatio-Temporal Encoder (DSTE), Multi-Grained Feature Decorrelation (MG-FD), and Multi-Perspective Consistency Training (MPCT). The DSTE module adopts two parallel streams to learn temporal dynamic and spatial structure features. The MG-FD module collaboratively performs feature decorrelation across temporal, spatial, and instance domains to reduce dimensional redundancy and enhance information extraction. The MPCT module employs both multi-view and multi-modal self-supervised consistency training. The former enhances the learning of high-level semantics and mitigates the impact of low-level discrepancies, while the latter effectively facilitates the learning of informative multimodal features. We perform extensive experiments on 25 benchmarks across across 9 skeleton-based action understanding tasks, covering coarse prediction, dense prediction, and transferred prediction. Our approach significantly outperforms the current state-of-the-art methods. We hope that this work would broaden the scope of research in skeleton-based action understanding and encourage more attention to dense prediction tasks.