Towards Universal Skeleton-Based Action Recognition

πŸ“… 2026-04-18
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

215K/year
πŸ€– AI Summary
This work addresses the limitations of existing approaches that overlook discrepancies in skeleton sources and structures and struggle with open-vocabulary action recognition. It presents the first systematic study on open-vocabulary action recognition under heterogeneous skeletal data. To this end, the authors introduce a large-scale Heterogeneous Open-Vocabulary (HOV) skeleton dataset and propose a unified Transformer-based framework. This framework leverages a standardized skeleton representation, a dual-stream motion encoder, and a multi-granularity action–text contrastive alignment mechanism operating at global, stream-specific, and fine-grained levels to enable cross-domain generalization for action understanding. Experiments demonstrate that the proposed method significantly outperforms current state-of-the-art approaches across multiple heterogeneous skeleton benchmarks, exhibiting strong generalization capabilities.

Technology Category

Application Category

πŸ“ Abstract
With the development of robotics, skeleton-based action recognition has become increasingly important, as human-robot interaction requires understanding the actions of humans and humanoid robots. Due to different sources of human skeletons and structures of humanoid robots, skeleton data naturally exhibit heterogeneity. However, previous works overlook the data heterogeneity of skeletons and solely construct models using homogeneous skeletons. Moreover, open-vocabulary action recognition is also essential for real-world applications. To this end, this work studies the challenging problem of heterogeneous skeleton-based action recognition with open vocabularies. We construct a large-scale Heterogeneous Open-Vocabulary (HOV) Skeleton dataset by integrating and refining multiple representative large-scale skeleton-based action datasets. To address universal skeleton-based action recognition, we propose a Transformer-based model that comprises three key components: unified skeleton representation, motion encoder for skeletons, and multi-grained motion-text alignment. The motion encoder feeds multi-modal skeleton embeddings into a two-stream Transformer-based encoder to learn spatio-temporal action representations, which are then mapped to a semantic space to align with text embeddings. Multi-grained motion-text alignment incorporates contrastive learning at three levels: global instance alignment, stream-specific alignment, and fine-grained alignment. Extensive experiments on popular benchmarks with heterogeneous skeleton data demonstrate both the effectiveness and the generalization ability of the proposed method. Code is available at https://github.com/jidongkuang/Universal-Skeleton.
Problem

Research questions and friction points this paper is trying to address.

heterogeneous skeletons
open-vocabulary action recognition
skeleton-based action recognition
human-robot interaction
data heterogeneity
Innovation

Methods, ideas, or system contributions that make the work stand out.

heterogeneous skeletons
open-vocabulary action recognition
Transformer-based model
multi-grained alignment
skeleton-text contrastive learning
πŸ”Ž Similar Papers
No similar papers found.