SCALE: Semantic- and Confidence-Aware Conditional Variational Autoencoder for Zero-shot Skeleton-based Action Recognition

📅 2026-04-02

📈 Citations: 0

✨ Influential: 0

career value

165K/year

🤖 AI Summary

This work addresses the fragile skeleton-text alignment in zero-shot skeleton-based action recognition, which stems from ambiguous action semantics and confusion among unseen classes. To this end, the authors propose a class-conditional energy ranking framework that leverages a text-conditional variational autoencoder with frozen textual representations to parameterize both the prior and decoder, enabling likelihood evaluation without generating samples at test time. A semantic- and confidence-aware listwise energy loss is introduced, dynamically adjusting decision boundaries based on posterior uncertainty. Furthermore, a latent prototype contrastive objective enhances semantic organization and class separability without requiring explicit feature alignment. The method achieves significant performance gains over existing VAE- and alignment-based approaches on NTU-60 and NTU-120, matching the efficacy of diffusion model–based solutions.

Technology Category

Application Category

📝 Abstract

Zero-shot skeleton-based action recognition (ZSAR) aims to recognize action classes without any training skeletons from those classes, relying instead on auxiliary semantics from text. Existing approaches frequently depend on explicit skeleton-text alignment, which can be brittle when action names underspecify fine-grained dynamics and when unseen classes are semantically confusable. We propose SCALE, a lightweight and deterministic Semantic- and Confidence-Aware Listwise Energy-based framework that formulates ZSAR as class-conditional energy ranking. SCALE builds a text-conditioned Conditional Variational Autoencoder where frozen text representations parameterize both the latent prior and the decoder, enabling likelihood-based evaluation for unseen classes without generating samples at test time. To separate competing hypotheses, we introduce a semantic- and confidence-aware listwise energy loss that emphasizes semantically similar hard negatives and incorporates posterior uncertainty to adapt decision margins and reweight ambiguous training instances. Additionally, we utilize a latent prototype contrast objective to align posterior means with text-derived latent prototypes, improving semantic organization and class separability without direct feature matching. Experiments on NTU-60 and NTU-120 datasets show that SCALE consistently improves over prior VAE- and alignment-based baselines while remaining competitive with diffusion-based methods.

Problem

Research questions and friction points this paper is trying to address.

Zero-shot skeleton-based action recognition

semantic ambiguity

fine-grained dynamics

unseen action classes

skeleton-text alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Zero-shot action recognition

Conditional Variational Autoencoder

Energy-based ranking