ArrowGEV: Grounding Events in Video via Learning the Arrow of Time

📅 2026-01-10

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

209K/year

🤖 AI Summary

This work addresses a critical limitation in existing video event localization methods—their neglect of the temporal directionality inherent in events—which hampers model performance in temporal structure understanding, robustness, and generalization. To overcome this, we propose ArrowGEV, a novel framework that introduces the physics-inspired concept of the “arrow of time” into video event localization for the first time. Leveraging reinforcement learning, ArrowGEV explicitly models the temporal sensitivity of events and incorporates a bidirectional training mechanism that imposes discriminative constraints on forward videos and consistency constraints on reversed videos. Integrated with a vision-language model, our approach employs temporally sensitive classification and forward–backward contrastive learning, significantly improving event localization accuracy, temporal direction recognition, and overall performance on general video understanding and reasoning tasks.

Technology Category

Application Category

📝 Abstract

Grounding events in videos serves as a fundamental capability in video analysis. While Vision-Language Models (VLMs) are increasingly employed for this task, existing approaches predominantly train models to associate events with timestamps in the forward video only. This paradigm hinders VLMs from capturing the inherent temporal structure and directionality of events, thereby limiting robustness and generalization. To address this limitation, inspired by the arrow of time in physics, which characterizes the intrinsic directionality of temporal processes, we propose ArrowGEV, a reinforcement learning framework that explicitly models temporal directionality in events to improve both event grounding and temporal directionality understanding in VLMs. Specifically, we categorize events into time-sensitive (e.g., putting down a bag) and time-insensitive (e.g., holding a towel in the left hand). The former denote events whose reversal substantially alters their meaning, while the latter remain semantically unchanged under reversal. For time-sensitive events, ArrowGEV introduces a reward that encourages VLMs to discriminate between forward and backward videos, whereas for time-insensitive events, it enforces consistent grounding across both directions. Extensive experiments demonstrate that ArrowGEV not only improves grounding precision and temporal directionality recognition, but also enhances general video understanding and reasoning ability.

Problem

Research questions and friction points this paper is trying to address.

event grounding

temporal directionality

video understanding

vision-language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Arrow of Time

Event Grounding

Vision-Language Models

Temporal Directionality