🤖 AI Summary
This work addresses zero-shot video action localization: precisely identifying the start and end timestamps of arbitrary semantic actions in long videos without any labeled data or predefined action categories. We propose a training-free, iterative visual prompting framework that leverages off-the-shelf vision-language models (VLMs), such as CLIP, for temporal boundary inference. Our core innovation is a frame-indexed image concatenation strategy serving as a structured visual prompt, coupled with a dynamic shrinking-window mechanism to progressively refine localization accuracy. The method integrates zero-shot VLM reasoning, adaptive frame sampling, structured image composition, and prompt engineering. On standard benchmarks, our approach achieves performance on par with state-of-the-art supervised methods under zero-shot settings—demonstrating, for the first time, the feasibility and practicality of using pure VLMs for fine-grained, temporally precise video understanding without task-specific training.
📝 Abstract
Video action localization aims to find the timings of specific actions from a long video. Although existing learning-based approaches have been successful, they require annotating videos, which comes with a considerable labor cost. This paper proposes a training-free, open-vocabulary approach based on emerging off-the-shelf vision-language models (VLMs). The challenge stems from the fact that VLMs are neither designed to process long videos nor tailored for finding actions. We overcome these problems by extending an iterative visual prompting technique. Specifically, we sample video frames and create a concatenated image with frame index labels, allowing a VLM to identify the frames that most likely correspond to the start and end of the action. By iteratively narrowing the sampling window around the selected frames, the estimation gradually converges to more precise temporal boundaries. We demonstrate that this technique yields reasonable performance, achieving results comparable to state-of-the-art zero-shot action localization. These results support the use of VLMs as a practical tool for understanding videos. Sample code is available at (https://microsoft.github.io/VLM-Video-Action-Localization/).