SkeletonContext: Skeleton-side Context Prompt Learning for Zero-Shot Skeleton-based Action Recognition

📅 2026-03-31

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

This work addresses the challenge in zero-shot skeleton-based action recognition where the absence of contextual cues—such as interacting objects—hinders effective alignment between skeletal and semantic representations, leading to difficulties in distinguishing visually similar actions. To overcome this limitation, the authors propose SkeletonContext, a novel framework that introduces a language model–guided contextual prompt reconstruction mechanism. By integrating cross-modal prompt learning with disentangled motion features of key joints, the method enables fine-grained action discrimination without requiring explicit object information. Extensive experiments demonstrate that SkeletonContext significantly outperforms existing approaches across multiple benchmark datasets, achieving state-of-the-art performance under both conventional and generalized zero-shot settings, thereby enhancing both semantic alignment accuracy and action recognition robustness.

Technology Category

Application Category

📝 Abstract

Zero-shot skeleton-based action recognition aims to recognize unseen actions by transferring knowledge from seen categories through semantic descriptions. Most existing methods typically align skeleton features with textual embeddings within a shared latent space. However, the absence of contextual cues, such as objects involved in the action, introduces an inherent gap between skeleton and semantic representations, making it difficult to distinguish visually similar actions. To address this, we propose SkeletonContext, a prompt-based framework that enriches skeletal motion representations with language-driven contextual semantics. Specifically, we introduce a Cross-Modal Context Prompt Module, which leverages a pretrained language model to reconstruct masked contextual prompts under guidance derived from LLMs. This design effectively transfers linguistic context to the skeleton encoder for instance-level semantic grounding and improved cross-modal alignment. In addition, a Key-Part Decoupling Module is incorporated to decouple motion-relevant joint features, ensuring robust action understanding even in the absence of explicit object interactions. Extensive experiments on multiple benchmarks demonstrate that SkeletonContext achieves state-of-the-art performance under both conventional and generalized zero-shot settings, validating its effectiveness in reasoning about context and distinguishing fine-grained, visually similar actions.

Problem

Research questions and friction points this paper is trying to address.

zero-shot action recognition

skeleton-based action recognition

contextual cues

cross-modal alignment

visually similar actions

Innovation

Methods, ideas, or system contributions that make the work stand out.

prompt learning

zero-shot action recognition

skeleton-based representation