Exploring Vision-Language Models for Open-Vocabulary Zero-Shot Action Segmentation

📅 2026-02-24

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

This work addresses the limitation of conventional temporal action segmentation methods, which rely on closed vocabularies and struggle to generalize to unseen action categories, by introducing and systematically investigating the open-vocabulary zero-shot temporal action segmentation task for the first time. The proposed approach requires no training and leverages vision-language models (VLMs) to compute frame–action embedding similarity (FAES), followed by similarity matrix–based temporal segmentation (SMTS) to delineate action boundaries. We conduct a large-scale evaluation of 14 VLMs on standard benchmarks, demonstrating that the framework achieves high-quality segmentation without any task-specific supervision, thereby strongly validating the potential of VLMs for structured temporal understanding.

Technology Category

Application Category

📝 Abstract

Temporal Action Segmentation (TAS) requires dividing videos into action segments, yet the vast space of activities and alternative breakdowns makes collecting comprehensive datasets infeasible. Existing methods remain limited to closed vocabularies and fixed label sets. In this work, we explore the largely unexplored problem of Open-Vocabulary Zero-Shot Temporal Action Segmentation (OVTAS) by leveraging the strong zero-shot capabilities of Vision-Language Models (VLMs). We introduce a training-free pipeline that follows a segmentation-by-classification design: Frame-Action Embedding Similarity (FAES) matches video frames to candidate action labels, and Similarity-Matrix Temporal Segmentation (SMTS) enforces temporal consistency. Beyond proposing OVTAS, we present a systematic study across 14 diverse VLMs, providing the first broad analysis of their suitability for open-vocabulary action segmentation. Experiments on standard benchmarks show that OVTAS achieves strong results without task-specific supervision, underscoring the potential of VLMs for structured temporal understanding.

Problem

Research questions and friction points this paper is trying to address.

Temporal Action Segmentation

Open-Vocabulary

Zero-Shot

Vision-Language Models

Action Segmentation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Open-Vocabulary

Zero-Shot

Temporal Action Segmentation