Zero-Shot Temporal Action Localization Through Textual Guidance

📅 2026-05-21
📈 Citations: 0
Influential: 0
📄 PDF

career value

199K/year
🤖 AI Summary
This work addresses the challenges of discriminating unseen action categories and reliance on labeled data in zero-shot temporal action localization by proposing TEGU, a fully unsupervised method that requires no training annotations. TEGU leverages fine-grained textual descriptions generated by large language models and structured semantic cues extracted from video captions to provide strong semantic guidance for vision-language models, thereby significantly enhancing their ability to distinguish novel actions. Experimental results on THUMOS14 and ActivityNet-v1.3 demonstrate that TEGU substantially outperforms existing state-of-the-art approaches under a completely unsupervised zero-shot setting.
📝 Abstract
Zero-shot temporal action localization (ZS-TAL) consists of classifying and localizing actions in untrimmed videos, where action classes are unseen at training time. Existing work uses Vision and Language Models (VLMs), taking advantage of their strong zero-shot transfer capabilities. Yet, these models face evident challenges with fine-grained action classification, making it difficult to directly use them to distinguish between the presence and absence of an action. Most current methods for ZS-TAL address these challenges by training models on large-scale video datasets, which require annotated data and often result in limited generalization performance. Recently, approaches discarding the use of labeled data have emerged as an alternative. Following this direction, we propose a novel approach, ``Textual Guidance for finer localization of actions in videos'' (TEGU), that compensates for the lack of supervision from training data by exploiting rich textual information derived from large language models and structured text extracted from captions. This additional linguistic context can improve fine-grained discrimination by providing richer cues about fine-grained action differences within videos. We validate the effectiveness of the proposed method by conducting experiments on the THUMOS14 and the ActivityNet-v1.3 datasets. Our results show that, by exploiting rich textual information for improved action localization, TEGU outperforms state-of-the-art ZS-TAL approaches that do not involve training
Problem

Research questions and friction points this paper is trying to address.

Zero-Shot Temporal Action Localization
Fine-grained Action Classification
Vision and Language Models
Untrimmed Videos
Textual Guidance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Zero-shot Temporal Action Localization
Textual Guidance
Vision and Language Models
Fine-grained Action Discrimination
Unsupervised Video Understanding
🔎 Similar Papers
No similar papers found.