Hierarchical Event Memory for Accurate and Low-latency Online Video Temporal Grounding

📅 2025-08-06

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

Online Video Temporal Grounding (OnVTG) requires models to localize text-query-matched event segments in streaming video under strict causality constraints—i.e., only historical frames are accessible, while future frames remain unseen. Existing approaches struggle with modeling long-range temporal event structures and suffer from insufficient historical memory capacity. To address these limitations, we propose an event-proposal-driven hierarchical memory mechanism that explicitly captures multi-granular event durations; a future-prediction branch enabling proactive response to upcoming events; and joint optimization of event-level features with a regression network for precise start-time prediction. Our method achieves state-of-the-art performance on TACoS, ActivityNet Captions, and MAD, significantly improving both localization accuracy and inference latency.

Technology Category

Application Category

📝 Abstract

In this paper, we tackle the task of online video temporal grounding (OnVTG), which requires the model to locate events related to a given text query within a video stream. Unlike regular video temporal grounding, OnVTG requires the model to make predictions without observing future frames. As online videos are streaming inputs and can go on indefinitely, it is impractical and inefficient to store all historical inputs. The existing OnVTG models employ memory to store recent historical video frame features and predict scores indicating whether the current frame corresponds to the start or end time of the target event. However, these methods lack effective event modeling and cannot retain long-term historical information, leading to low performance. To tackle these challenges, we propose a hierarchical event memory for OnVTG. We propose an event-based OnVTG framework that makes predictions based on event proposals that model event-level information with various durations. To preserve historically valuable event information, we introduce a hierarchical event memory that retains historical events, allowing the model to access both recent and long-term information. To enable the real-time prediction, we further propose a future prediction branch that predicts whether the target event will occur shortly and further regresses the start time of the event. We achieve state-of-the-art performance on the TACoS, ActivityNet Captions, and MAD datasets. Code is available at https://github.com/minghangz/OnVTG.

Problem

Research questions and friction points this paper is trying to address.

Locate text-related events in streaming videos without future frames

Improve event modeling and retain long-term historical video information

Enable real-time prediction of event start times in online videos

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical event memory for long-term information retention

Event-based framework for modeling event-level durations

Future prediction branch for real-time event forecasting

🔎 Similar Papers

Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models