LLM Enhanced Action Recognition via Hierarchical Global-Local Skeleton-Language Model

📅 2026-03-27

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

Existing skeleton-based action recognition methods struggle to effectively model long-range joint dependencies, complex temporal dynamics, and cross-modal semantic alignment. To address these limitations, this work proposes the Hierarchical Global-Local Skeleton-Language Model (HocSLM), which for the first time deeply integrates large vision-language models with skeletal representations. HocSLM employs an HGLNet architecture to capture multi-scale spatiotemporal dependencies and introduces a skeleton-language sequence fusion module that enables precise alignment within a unified semantic space. The proposed method achieves state-of-the-art performance across three benchmark datasets—NTU RGB+D 60, NTU RGB+D 120, and Northwestern-UCLA—demonstrating significantly enhanced semantic discriminability and cross-modal understanding capabilities.

Technology Category

Application Category

📝 Abstract

Skeleton-based human action recognition has achieved remarkable progress in recent years. However, most existing GCN-based methods rely on short-range motion topologies, which not only struggle to capture long-range joint dependencies and complex temporal dynamics but also limit cross-modal semantic alignment and understanding due to insufficient modeling of action semantics. To address these challenges, we propose a hierarchical global-local skeleton-language model (HocSLM), enabling the large action model be more representative of action semantics. First, we design a hierarchical global-local network (HGLNet) that consists of a composite-topology spatial module and a dual-path hierarchical temporal module. By synergistically integrating multi-level global and local modules, HGLNet achieves dynamically collaborative modeling at both global and local scales while preserving prior knowledge of human physical structure, significantly enhancing the model's representation of complex spatio-temporal relationships. Then, a large vision-language model (VLM) is employed to generate textual descriptions by passing the original RGB video sequences to this model, providing the rich action semantics for further training the skeleton-language model. Furthermore, we introduce a skeleton-language sequential fusion module by combining the features from HGLNet and the generated descriptions, which utilizes a skeleton-language model (SLM) for aligning skeletal spatio-temporal features and textual action descriptions precisely within a unified semantic space. The SLM model could significantly enhance the HGLNet's semantic discrimination capabilities and cross-modal understanding abilities. Extensive experiments demonstrate that the proposed HocSLM achieves the state-of-the-art performance on three mainstream benchmark datasets: NTU RGB+D 60, NTU RGB+D 120, and Northwestern-UCLA.

Problem

Research questions and friction points this paper is trying to address.

skeleton-based action recognition

long-range dependencies

temporal dynamics

action semantics

cross-modal alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

skeleton-language model

hierarchical global-local network

vision-language model