CEZSAR: A Contrastive Embedding Method for Zero-Shot Action Recognition

📅 2026-05-01

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

This work addresses the semantic gap and domain shift challenges in zero-shot action recognition by proposing a contrastive learning–based multimodal joint embedding approach. It constructs a unified semantic space between videos and natural language descriptions and introduces an automatic negative sampling mechanism to generate unpaired data for enhanced training, thereby effectively aligning cross-modal representations and mitigating distributional shift. By integrating a video encoder—such as a Transformer—with a pretrained language model, the method achieves state-of-the-art performance across multiple zero-shot splits of UCF-101 and Kinetics-400, significantly improving recognition accuracy for unseen action categories.

📝 Abstract

This paper proposes a novel Zero-Shot Action Recognition~(ZSAR) method based on contrastive learning. In ZSAR, we aim to classify examples from classes that were missing during training. Two well-known problems remain in ZSAR: the semantic gap and the domain shift. A semantic gap occurs because label representations come from the textual domain (i.e., language models) and must be associated with visual representations (i.e., CNNs, RNNs, transformer-based). This multimodal nature implies that the semantic properties of the two spaces are not identical. On the other hand, the domain shift arises from differences between the training and test sets and is inherent to ZSAR once the test set is unknown. One of the most promising methods to address both issues is learning joint embedding spaces. Therefore, we propose a new model that encodes videos and sentences in a joint embedding space, trained by aligning videos with their natural-language descriptions. We design an automatic negative sampling procedure to augment the training dataset and generate unpaired data, i.e., visual appearance and unrelated descriptions. Our results are state-of-the-art on the UCF-101 and Kinetics-400 datasets under several split configurations. Our code is available at https://github.com/valterlej/cezsar.

Problem

Research questions and friction points this paper is trying to address.

Zero-Shot Action Recognition

Semantic Gap

Domain Shift

Multimodal Embedding

Unseen Class Recognition

Innovation

Methods, ideas, or system contributions that make the work stand out.

Contrastive Learning

Zero-Shot Action Recognition

Joint Embedding Space