Zero-Shot Temporal Action Localization Through Textual Guidance

📅 2026-05-21

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

This work addresses the challenges of discriminating unseen action categories and reliance on labeled data in zero-shot temporal action localization by proposing TEGU, a fully unsupervised method that requires no training annotations. TEGU leverages fine-grained textual descriptions generated by large language models and structured semantic cues extracted from video captions to provide strong semantic guidance for vision-language models, thereby significantly enhancing their ability to distinguish novel actions. Experimental results on THUMOS14 and ActivityNet-v1.3 demonstrate that TEGU substantially outperforms existing state-of-the-art approaches under a completely unsupervised zero-shot setting.

📝 Abstract

Zero-shot temporal action localization (ZS-TAL) consists of classifying and localizing actions in untrimmed videos, where action classes are unseen at training time. Existing work uses Vision and Language Models (VLMs), taking advantage of their strong zero-shot transfer capabilities. Yet, these models face evident challenges with fine-grained action classification, making it difficult to directly use them to distinguish between the presence and absence of an action. Most current methods for ZS-TAL address these challenges by training models on large-scale video datasets, which require annotated data and often result in limited generalization performance. Recently, approaches discarding the use of labeled data have emerged as an alternative. Following this direction, we propose a novel approach, ``Textual Guidance for finer localization of actions in videos'' (TEGU), that compensates for the lack of supervision from training data by exploiting rich textual information derived from large language models and structured text extracted from captions. This additional linguistic context can improve fine-grained discrimination by providing richer cues about fine-grained action differences within videos. We validate the effectiveness of the proposed method by conducting experiments on the THUMOS14 and the ActivityNet-v1.3 datasets. Our results show that, by exploiting rich textual information for improved action localization, TEGU outperforms state-of-the-art ZS-TAL approaches that do not involve training

Problem

Research questions and friction points this paper is trying to address.

Zero-Shot Temporal Action Localization

Fine-grained Action Classification

Vision and Language Models

Untrimmed Videos

Textual Guidance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Zero-shot Temporal Action Localization

Textual Guidance

Vision and Language Models