🤖 AI Summary
Existing skeleton-based zero-shot action recognition methods rely on static, class-level semantic alignment, which inadequately mitigates domain shift between seen and unseen classes and hinders fine-grained knowledge transfer. To address this, we propose a dynamic multi-scale vision–semantics matching framework: (1) a novel inference-time dynamic text feature refinement mechanism to enhance semantic alignment precision; (2) hierarchical action descriptions generated by large language models, integrated with adaptive joint grouping for structural-aware skeletal modeling; and (3) a confidence-aware class-balanced memory bank to suppress pseudo-label noise. Our method achieves new state-of-the-art performance on NTU RGB+D 60/120 and PKU-MMD, demonstrating significantly improved zero-shot generalization. The code is publicly available.
📝 Abstract
Zero-shot skeleton-based action recognition (ZS-SAR) is fundamentally constrained by prevailing approaches that rely on aligning skeleton features with static, class-level semantics. This coarse-grained alignment fails to bridge the domain shift between seen and unseen classes, thereby impeding the effective transfer of fine-grained visual knowledge. To address these limitations, we introduce extbf{DynaPURLS}, a unified framework that establishes robust, multi-scale visual-semantic correspondences and dynamically refines them at inference time to enhance generalization. Our framework leverages a large language model to generate hierarchical textual descriptions that encompass both global movements and local body-part dynamics. Concurrently, an adaptive partitioning module produces fine-grained visual representations by semantically grouping skeleton joints. To fortify this fine-grained alignment against the train-test domain shift, DynaPURLS incorporates a dynamic refinement module. During inference, this module adapts textual features to the incoming visual stream via a lightweight learnable projection. This refinement process is stabilized by a confidence-aware, class-balanced memory bank, which mitigates error propagation from noisy pseudo-labels. Extensive experiments on three large-scale benchmark datasets, including NTU RGB+D 60/120 and PKU-MMD, demonstrate that DynaPURLS significantly outperforms prior art, setting new state-of-the-art records. The source code is made publicly available at https://github.com/Alchemist0754/DynaPURLS