🤖 AI Summary
Existing skeleton-based action recognition methods struggle to effectively model long-range joint dependencies, complex temporal dynamics, and cross-modal semantic alignment. To address these limitations, this work proposes the Hierarchical Global-Local Skeleton-Language Model (HocSLM), which for the first time deeply integrates large vision-language models with skeletal representations. HocSLM employs an HGLNet architecture to capture multi-scale spatiotemporal dependencies and introduces a skeleton-language sequence fusion module that enables precise alignment within a unified semantic space. The proposed method achieves state-of-the-art performance across three benchmark datasets—NTU RGB+D 60, NTU RGB+D 120, and Northwestern-UCLA—demonstrating significantly enhanced semantic discriminability and cross-modal understanding capabilities.
📝 Abstract
Skeleton-based human action recognition has achieved remarkable progress in recent years. However, most existing GCN-based methods rely on short-range motion topologies, which not only struggle to capture long-range joint dependencies and complex temporal dynamics but also limit cross-modal semantic alignment and understanding due to insufficient modeling of action semantics. To address these challenges, we propose a hierarchical global-local skeleton-language model (HocSLM), enabling the large action model be more representative of action semantics. First, we design a hierarchical global-local network (HGLNet) that consists of a composite-topology spatial module and a dual-path hierarchical temporal module. By synergistically integrating multi-level global and local modules, HGLNet achieves dynamically collaborative modeling at both global and local scales while preserving prior knowledge of human physical structure, significantly enhancing the model's representation of complex spatio-temporal relationships. Then, a large vision-language model (VLM) is employed to generate textual descriptions by passing the original RGB video sequences to this model, providing the rich action semantics for further training the skeleton-language model. Furthermore, we introduce a skeleton-language sequential fusion module by combining the features from HGLNet and the generated descriptions, which utilizes a skeleton-language model (SLM) for aligning skeletal spatio-temporal features and textual action descriptions precisely within a unified semantic space. The SLM model could significantly enhance the HGLNet's semantic discrimination capabilities and cross-modal understanding abilities. Extensive experiments demonstrate that the proposed HocSLM achieves the state-of-the-art performance on three mainstream benchmark datasets: NTU RGB+D 60, NTU RGB+D 120, and Northwestern-UCLA.