🤖 AI Summary
Zero-shot skeleton-based action recognition is highly challenging due to the absence of skeletal priors for unseen classes. Existing “alignment → classification” paradigms suffer from fragile point-to-point alignment and static, coarse-grained decision boundaries. To address these limitations, we propose Flora: (1) a direction-aware regional semantic modeling scheme with point-to-region cross-modal alignment to mitigate semantic mismatch; (2) neighbor-aware semantic tuning and open-flow classification, enabling distribution-aware, fine-grained, and dynamically adaptive decision boundaries; and (3) integration of neighborhood context enhancement, geometric consistency constraints, noise-free flow matching, and token-level velocity prediction to achieve modality-distribution alignment between skeleton and semantic embeddings. Evaluated on three benchmarks, Flora achieves significant improvements over state-of-the-art methods using only 10% of labeled seen-class data, demonstrating superior robustness and generalization capability.
📝 Abstract
Recognizing unseen skeleton action categories remains highly challenging due to the absence of corresponding skeletal priors. Existing approaches generally follow an"align-then-classify"paradigm but face two fundamental issues, i.e., (i) fragile point-to-point alignment arising from imperfect semantics, and (ii) rigid classifiers restricted by static decision boundaries and coarse-grained anchors. To address these issues, we propose a novel method for zero-shot skeleton action recognition, termed $ exttt{$ extbf{Flora}$}$, which builds upon $ extbf{F}$lexib$ extbf{L}$e neighb$ extbf{O}$r-aware semantic attunement and open-form dist$ extbf{R}$ibution-aware flow cl$ extbf{A}$ssifier. Specifically, we flexibly attune textual semantics by incorporating neighboring inter-class contextual cues to form direction-aware regional semantics, coupled with a cross-modal geometric consistency objective that ensures stable and robust point-to-region alignment. Furthermore, we employ noise-free flow matching to bridge the modality distribution gap between semantic and skeleton latent embeddings, while a condition-free contrastive regularization enhances discriminability, leading to a distribution-aware classifier with fine-grained decision boundaries achieved through token-level velocity predictions. Extensive experiments on three benchmark datasets validate the effectiveness of our method, showing particularly impressive performance even when trained with only 10% of the seen data. Code is available at https://github.com/cseeyangchen/Flora.