Motion-Guided Semantic Alignment with Negative Prompts for Zero-Shot Video Action Recognition

📅 2026-04-18

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

This work addresses the semantic gap between seen and unseen action categories in zero-shot video action recognition by proposing a CLIP-based decoupling and semantic-guided alignment approach. The method employs a Motion Separation Module (MSM) to disentangle video features and introduces a Motion Aggregation Block (MAB) with gated cross-attention to effectively fuse motion-related information. Notably, it is the first to incorporate positive–negative textual prompt pairs to explicitly model “non-category” semantics, thereby enhancing cross-modal alignment. Evaluated on multiple standard benchmarks, the proposed approach significantly outperforms existing CLIP-based methods and demonstrates strong zero-shot generalization capabilities across both coarse- and fine-grained datasets.

Technology Category

Application Category

📝 Abstract

Zero-shot action recognition is challenging due to the semantic gap between seen and unseen classes. We present a novel framework that enhances CLIP with disentangled embeddings and semantic-guided interaction. A Motion Separation Module (MSM) separates motion-sensitive and global-static features, while a Motion Aggregation Block (MAB) employs gated cross-attention to refine motion representation without re-coupling redundant information. To facilitate generalization to unseen categories, we enforce semantic alignment between video features and textual representations by aligning projected embeddings with positive textual prompts, while leveraging negative prompts to explicitly model "non-class" semantics. Experiments on standard benchmarks demonstrate that our method consistently outperforms prior CLIP-based approaches, achieving robust zero-shot action recognition across both coarse and fine-grained datasets.

Problem

Research questions and friction points this paper is trying to address.

zero-shot action recognition

semantic gap

unseen classes

video action recognition

semantic alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Motion Separation Module

Motion Aggregation Block

Negative Prompts