π€ AI Summary
This paper proposes BaGLM, a training-free framework for online video step grounding (VSG)βthe task of real-time identifying executed procedural steps in a video without labeled data or full-video offline processing. Methodologically, BaGLM leverages the zero-shot multimodal understanding capability of large vision-language models (VLMs) to classify local frame sequences into steps, and integrates a Bayesian filtering mechanism to recursively fuse temporal evidence from historical frames with step-transition priors implicitly encoded by the VLM, enabling robust progress estimation and dynamic step inference. Its core contribution is the first integration of Bayesian state estimation with VLM-based zero-shot perception, eliminating both training dependencies and the need for global video access. Evaluated on three standard benchmarks, BaGLM outperforms existing supervised offline methods, achieving state-of-the-art accuracy in online VSG.
π Abstract
Given a task and a set of steps composing it, Video Step Grounding (VSG) aims to detect which steps are performed in a video. Standard approaches for this task require a labeled training set (e.g., with step-level annotations or narrations), which may be costly to collect. Moreover, they process the full video offline, limiting their applications for scenarios requiring online decisions. Thus, in this work, we explore how to perform VSG online and without training. We achieve this by exploiting the zero-shot capabilities of recent Large Multimodal Models (LMMs). In particular, we use LMMs to predict the step associated with a restricted set of frames, without access to the whole video. We show that this online strategy without task-specific tuning outperforms offline and training-based models. Motivated by this finding, we develop Bayesian Grounding with Large Multimodal Models (BaGLM), further injecting knowledge of past frames into the LMM-based predictions. BaGLM exploits Bayesian filtering principles, modeling step transitions via (i) a dependency matrix extracted through large language models and (ii) an estimation of step progress. Experiments on three datasets show superior performance of BaGLM over state-of-the-art training-based offline methods.