LLM Enhanced Action Recognition via Hierarchical Global-Local Skeleton-Language Model

📅 2026-03-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing skeleton-based action recognition methods struggle to effectively model long-range joint dependencies, complex temporal dynamics, and cross-modal semantic alignment. To address these limitations, this work proposes the Hierarchical Global-Local Skeleton-Language Model (HocSLM), which for the first time deeply integrates large vision-language models with skeletal representations. HocSLM employs an HGLNet architecture to capture multi-scale spatiotemporal dependencies and introduces a skeleton-language sequence fusion module that enables precise alignment within a unified semantic space. The proposed method achieves state-of-the-art performance across three benchmark datasets—NTU RGB+D 60, NTU RGB+D 120, and Northwestern-UCLA—demonstrating significantly enhanced semantic discriminability and cross-modal understanding capabilities.
📝 Abstract
Skeleton-based human action recognition has achieved remarkable progress in recent years. However, most existing GCN-based methods rely on short-range motion topologies, which not only struggle to capture long-range joint dependencies and complex temporal dynamics but also limit cross-modal semantic alignment and understanding due to insufficient modeling of action semantics. To address these challenges, we propose a hierarchical global-local skeleton-language model (HocSLM), enabling the large action model be more representative of action semantics. First, we design a hierarchical global-local network (HGLNet) that consists of a composite-topology spatial module and a dual-path hierarchical temporal module. By synergistically integrating multi-level global and local modules, HGLNet achieves dynamically collaborative modeling at both global and local scales while preserving prior knowledge of human physical structure, significantly enhancing the model's representation of complex spatio-temporal relationships. Then, a large vision-language model (VLM) is employed to generate textual descriptions by passing the original RGB video sequences to this model, providing the rich action semantics for further training the skeleton-language model. Furthermore, we introduce a skeleton-language sequential fusion module by combining the features from HGLNet and the generated descriptions, which utilizes a skeleton-language model (SLM) for aligning skeletal spatio-temporal features and textual action descriptions precisely within a unified semantic space. The SLM model could significantly enhance the HGLNet's semantic discrimination capabilities and cross-modal understanding abilities. Extensive experiments demonstrate that the proposed HocSLM achieves the state-of-the-art performance on three mainstream benchmark datasets: NTU RGB+D 60, NTU RGB+D 120, and Northwestern-UCLA.
Problem

Research questions and friction points this paper is trying to address.

skeleton-based action recognition
long-range dependencies
temporal dynamics
action semantics
cross-modal alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

skeleton-language model
hierarchical global-local network
vision-language model
spatio-temporal modeling
cross-modal alignment
🔎 Similar Papers
No similar papers found.
Ruosi Wang
Ruosi Wang
Shanghai International Studies Unviersity
Cognitive NeuroscienceObject Recognition
F
Fangwei Zuo
Northwestern Polytechnical University, No.1 Dongxiang Road, Chang’an District , Xi’an, 710129, Shaanxi, China
L
Lei Li
Shandong University of Finance and Economics, No.40 Shungeng Road, Licheng District, Jinan, Shandong, China
Zhaoqiang Xia
Zhaoqiang Xia
Northwestern Polytechnical University
Visual ComputingInformation Processing