🤖 AI Summary
In zero-shot skeleton-based action recognition, inaccurate alignment between visual features and semantic vectors, along with insufficient robustness and discriminability of cross-modal embedding spaces, remain critical challenges. To address these issues, we propose a dual vision–text alignment framework. Our approach introduces, for the first time, a “direct mapping + enhanced alignment” dual-path mechanism and incorporates a Semantic Description Enhancement (SDE) module based on cross-attention to jointly model skeletal motion dynamics and action semantics, thereby effectively bridging the modality gap. By integrating a visual projector, deep metric learning, skeleton sequence modeling, and joint optimization of multimodal embeddings, our method achieves state-of-the-art performance across multiple mainstream zero-shot benchmarks. It significantly improves both generalized recognition accuracy and cross-domain robustness.
📝 Abstract
Zero-shot action recognition, which addresses the issue of scalability and generalization in action recognition and allows the models to adapt to new and unseen actions dynamically, is an important research topic in computer vision communities. The key to zero-shot action recognition lies in aligning visual features with semantic vectors representing action categories. Most existing methods either directly project visual features onto the semantic space of text category or learn a shared embedding space between the two modalities. However, a direct projection cannot accurately align the two modalities, and learning robust and discriminative embedding space between visual and text representations is often difficult. To address these issues, we introduce Dual Visual-Text Alignment (DVTA) for skeleton-based zero-shot action recognition. The DVTA consists of two alignment modules-Direct Alignment (DA) and Augmented Alignment (AA)-along with a designed Semantic Description Enhancement (SDE). The DA module maps the skeleton features to the semantic space through a specially designed visual projector, followed by the SDE, which is based on cross-attention to enhance the connection between skeleton and text, thereby reducing the gap between modalities. The AA module further strengthens the learning of the embedding space by utilizing deep metric learning to learn the similarity between skeleton and text. Our approach achieves state-of-the-art performances on several popular zero-shot skeleton-based action recognition benchmarks.