🤖 AI Summary
This work addresses the limitation of conventional temporal action segmentation methods, which rely on closed vocabularies and struggle to generalize to unseen action categories, by introducing and systematically investigating the open-vocabulary zero-shot temporal action segmentation task for the first time. The proposed approach requires no training and leverages vision-language models (VLMs) to compute frame–action embedding similarity (FAES), followed by similarity matrix–based temporal segmentation (SMTS) to delineate action boundaries. We conduct a large-scale evaluation of 14 VLMs on standard benchmarks, demonstrating that the framework achieves high-quality segmentation without any task-specific supervision, thereby strongly validating the potential of VLMs for structured temporal understanding.
📝 Abstract
Temporal Action Segmentation (TAS) requires dividing videos into action segments, yet the vast space of activities and alternative breakdowns makes collecting comprehensive datasets infeasible. Existing methods remain limited to closed vocabularies and fixed label sets. In this work, we explore the largely unexplored problem of Open-Vocabulary Zero-Shot Temporal Action Segmentation (OVTAS) by leveraging the strong zero-shot capabilities of Vision-Language Models (VLMs). We introduce a training-free pipeline that follows a segmentation-by-classification design: Frame-Action Embedding Similarity (FAES) matches video frames to candidate action labels, and Similarity-Matrix Temporal Segmentation (SMTS) enforces temporal consistency. Beyond proposing OVTAS, we present a systematic study across 14 diverse VLMs, providing the first broad analysis of their suitability for open-vocabulary action segmentation. Experiments on standard benchmarks show that OVTAS achieves strong results without task-specific supervision, underscoring the potential of VLMs for structured temporal understanding.