🤖 AI Summary
This work addresses the semantic gap between seen and unseen action categories in zero-shot video action recognition by proposing a CLIP-based decoupling and semantic-guided alignment approach. The method employs a Motion Separation Module (MSM) to disentangle video features and introduces a Motion Aggregation Block (MAB) with gated cross-attention to effectively fuse motion-related information. Notably, it is the first to incorporate positive–negative textual prompt pairs to explicitly model “non-category” semantics, thereby enhancing cross-modal alignment. Evaluated on multiple standard benchmarks, the proposed approach significantly outperforms existing CLIP-based methods and demonstrates strong zero-shot generalization capabilities across both coarse- and fine-grained datasets.
📝 Abstract
Zero-shot action recognition is challenging due to the semantic gap between seen and unseen classes. We present a novel framework that enhances CLIP with disentangled embeddings and semantic-guided interaction. A Motion Separation Module (MSM) separates motion-sensitive and global-static features, while a Motion Aggregation Block (MAB) employs gated cross-attention to refine motion representation without re-coupling redundant information. To facilitate generalization to unseen categories, we enforce semantic alignment between video features and textual representations by aligning projected embeddings with positive textual prompts, while leveraging negative prompts to explicitly model "non-class" semantics. Experiments on standard benchmarks demonstrate that our method consistently outperforms prior CLIP-based approaches, achieving robust zero-shot action recognition across both coarse and fine-grained datasets.