Learning by Neighbor-Aware Semantics, Deciding by Open-form Flows: Towards Robust Zero-Shot Skeleton Action Recognition

📅 2025-11-12

📈 Citations: 0

✨ Influential: 0

career value

145K/year

🤖 AI Summary

Zero-shot skeleton-based action recognition is highly challenging due to the absence of skeletal priors for unseen classes. Existing “alignment → classification” paradigms suffer from fragile point-to-point alignment and static, coarse-grained decision boundaries. To address these limitations, we propose Flora: (1) a direction-aware regional semantic modeling scheme with point-to-region cross-modal alignment to mitigate semantic mismatch; (2) neighbor-aware semantic tuning and open-flow classification, enabling distribution-aware, fine-grained, and dynamically adaptive decision boundaries; and (3) integration of neighborhood context enhancement, geometric consistency constraints, noise-free flow matching, and token-level velocity prediction to achieve modality-distribution alignment between skeleton and semantic embeddings. Evaluated on three benchmarks, Flora achieves significant improvements over state-of-the-art methods using only 10% of labeled seen-class data, demonstrating superior robustness and generalization capability.

Technology Category

Application Category

📝 Abstract

Recognizing unseen skeleton action categories remains highly challenging due to the absence of corresponding skeletal priors. Existing approaches generally follow an"align-then-classify"paradigm but face two fundamental issues, i.e., (i) fragile point-to-point alignment arising from imperfect semantics, and (ii) rigid classifiers restricted by static decision boundaries and coarse-grained anchors. To address these issues, we propose a novel method for zero-shot skeleton action recognition, termed $ exttt{$ extbf{Flora}$}$, which builds upon $ extbf{F}$lexib$ extbf{L}$e neighb$ extbf{O}$r-aware semantic attunement and open-form dist$ extbf{R}$ibution-aware flow cl$ extbf{A}$ssifier. Specifically, we flexibly attune textual semantics by incorporating neighboring inter-class contextual cues to form direction-aware regional semantics, coupled with a cross-modal geometric consistency objective that ensures stable and robust point-to-region alignment. Furthermore, we employ noise-free flow matching to bridge the modality distribution gap between semantic and skeleton latent embeddings, while a condition-free contrastive regularization enhances discriminability, leading to a distribution-aware classifier with fine-grained decision boundaries achieved through token-level velocity predictions. Extensive experiments on three benchmark datasets validate the effectiveness of our method, showing particularly impressive performance even when trained with only 10% of the seen data. Code is available at https://github.com/cseeyangchen/Flora.

Problem

Research questions and friction points this paper is trying to address.

Addresses fragile point-to-point alignment in skeleton action recognition

Solves rigid classifier limitations with static decision boundaries

Bridges modality distribution gap between semantic and skeleton embeddings

Innovation

Methods, ideas, or system contributions that make the work stand out.

Flexible neighbor-aware semantic attunement for alignment

Cross-modal geometric consistency for robust point-to-region alignment

Noise-free flow matching with distribution-aware fine-grained classification

🔎 Similar Papers

Self-Supervised Skeleton-Based Action Representation Learning: A Benchmark and Beyond