🤖 AI Summary
Zero-shot skeleton-based action recognition faces two key challenges: insufficient fine-grained action semantic modeling and a substantial modality gap between skeleton and textual representations. To address these, we propose the Frequency-Semantic Enhanced Variational Autoencoder (FS-VVAE), the first framework to incorporate frequency-domain decomposition into zero-shot action recognition. It employs high-/low-frequency adjustment modules to preserve motion details while suppressing noise, and leverages multi-level semantic descriptions (verb + noun + body part) for fine-grained alignment. Additionally, we introduce a calibrated cross-modal contrastive loss to jointly optimize skeleton representations and textual semantics in a shared embedding space. This design effectively mitigates temporal information loss and semantic ambiguity. Experiments on standard benchmarks—including NTU and Zappos—demonstrate significant improvements in unseen-class recognition accuracy. Notably, FS-VVAE achieves superior discriminability and robustness on visually or semantically similar action pairs (e.g., “drinking” vs. “brushing teeth”).
📝 Abstract
Zero-shot skeleton-based action recognition aims to develop models capable of identifying actions beyond the categories encountered during training. Previous approaches have primarily focused on aligning visual and semantic representations but often overlooked the importance of fine-grained action patterns in the semantic space (e.g., the hand movements in drinking water and brushing teeth). To address these limitations, we propose a Frequency-Semantic Enhanced Variational Autoencoder (FS-VAE) to explore the skeleton semantic representation learning with frequency decomposition. FS-VAE consists of three key components: 1) a frequency-based enhancement module with high- and low-frequency adjustments to enrich the skeletal semantics learning and improve the robustness of zero-shot action recognition; 2) a semantic-based action description with multilevel alignment to capture both local details and global correspondence, effectively bridging the semantic gap and compensating for the inherent loss of information in skeleton sequences; 3) a calibrated cross-alignment loss that enables valid skeleton-text pairs to counterbalance ambiguous ones, mitigating discrepancies and ambiguities in skeleton and text features, thereby ensuring robust alignment. Evaluations on the benchmarks demonstrate the effectiveness of our approach, validating that frequency-enhanced semantic features enable robust differentiation of visually and semantically similar action clusters, improving zero-shot action recognition.