OZ-TAL: Online Zero-Shot Temporal Action Localization

📅 2026-05-11

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

This work introduces Online Zero-shot Temporal Action Localization (OZ-TAL), a novel task aiming to detect and recognize action categories unseen during training in streaming video in real time. To tackle this challenge, the authors propose a training-free online localization framework leveraging off-the-shelf vision-language models (VLMs), enhanced with visual representation refinement and model bias correction mechanisms to improve generalization. The method establishes the first zero-shot online temporal action localization benchmark on THUMOS14 and ActivityNet-1.3, significantly outperforming existing state-of-the-art approaches under both offline and online evaluation settings.

📝 Abstract

Online Temporal Action Localization (On-TAL) aims to detect the occurrence time and category of actions in untrimmed streaming videos immediately upon their completion. Recent advancements in this field focus on developing more sophisticated frameworks, shifting from Online Action Detection (OAD)-based aggregation paradigm to instance-level understanding. However, existing approaches are typically trained on specific domains and often exhibit limited generalization capabilities when applied to arbitrary videos, particularly in the presence of previously unseen actions. In this paper, we introduce a new task called Online Zero-shot Temporal Action Localization (OZ-TAL), which aims to detect previously unseen actions in an online fashion. Furthermore, we propose a training-free framework that leverages off-the-shelf Vision-Language Models (VLMs) while introducing additional mechanisms to enhance visual representations and mitigate their inherent biases. We establish new benchmarks and representative baselines for OZ-TAL on THUMOS14 and ActivityNet-1.3, and extensive experiments demonstrate that our method substantially outperforms existing state-of-the-art approaches under both offline and online zero-shot settings.

Problem

Research questions and friction points this paper is trying to address.

Online Temporal Action Localization

Zero-shot Learning

Unseen Actions

Streaming Videos

Generalization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Online Zero-shot Temporal Action Localization

Vision-Language Models

Training-free Framework