Training-Free Zero-Shot Temporal Action Detection with Vision-Language Models

📅 2025-01-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing zero-shot action recognition methods suffer from sensitivity to environmental variations and high computational overhead, limiting their practicality. This paper introduces the first training-free zero-shot temporal action detection framework, which directly leverages off-the-shelf vision-language models (ViLMs) to localize and classify unseen actions in untrimmed videos—eliminating domain shift and training costs entirely. Key contributions include: (1) a training-free paradigm; (2) a logarithmic decay-weighted intra- and inter-class contrastive scoring mechanism (LogOIC); (3) frequency-driven actionness calibration; and (4) prototype-centered sampling (PCS), a test-time adaptive strategy. Our method achieves substantial improvements over unsupervised state-of-the-art on THUMOS14 and ActivityNet-1.3, with inference speed 13× faster. When augmented with test-time adaptation (TTA), its performance approaches that of fully supervised methods.

Technology Category

Application Category

📝 Abstract
Existing zero-shot temporal action detection (ZSTAD) methods predominantly use fully supervised or unsupervised strategies to recognize unseen activities. However, these training-based methods are prone to domain shifts and require high computational costs, which hinder their practical applicability in real-world scenarios. In this paper, unlike previous works, we propose a training-Free Zero-shot temporal Action Detection (FreeZAD) method, leveraging existing vision-language (ViL) models to directly classify and localize unseen activities within untrimmed videos without any additional fine-tuning or adaptation. We mitigate the need for explicit temporal modeling and reliance on pseudo-label quality by designing the LOGarithmic decay weighted Outer-Inner-Contrastive Score (LogOIC) and frequency-based Actionness Calibration. Furthermore, we introduce a test-time adaptation (TTA) strategy using Prototype-Centric Sampling (PCS) to expand FreeZAD, enabling ViL models to adapt more effectively for ZSTAD. Extensive experiments on the THUMOS14 and ActivityNet-1.3 datasets demonstrate that our training-free method outperforms state-of-the-art unsupervised methods while requiring only 1/13 of the runtime. When equipped with TTA, the enhanced method further narrows the gap with fully supervised methods.
Problem

Research questions and friction points this paper is trying to address.

Zero-shot Action Recognition
Environmental Variability
Computational Cost
Innovation

Methods, ideas, or system contributions that make the work stand out.

Zero-shot Action Detection
Visual Language Models
Prototype Center Sampling
🔎 Similar Papers
No similar papers found.
Chaolei Han
Chaolei Han
Southeast University
Computer VisionVideo AnalysisAction Detection
H
Hongsong Wang
Southeast University, School of Computer Science and Engineering
J
Jidong Kuang
Southeast University, School of Cyber Science and Engineering
L
Lei Zhang
Nanjing Normal University, School of Electrical Engineering and Automation
Jie Gui
Jie Gui
Southeast University, China
Pattern Recognition and Machine LearningArtificial IntelligenceData MiningDeep LearningImage Processing and Computer Vis