What Happens When: Learning Temporal Orders of Events in Videos

📅 2025-12-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video large multimodal models (VLMMs) achieve strong performance on standard benchmarks but empirically exhibit severe deficiencies in reasoning about temporal order among multiple events—often relying on scene-level priors rather than genuine temporal inference. To address this, we introduce VECTOR, the first benchmark explicitly designed for evaluating multi-event temporal ordering. We further propose MECOT, a temporal modeling framework that synergistically combines multi-event fine-grained instruction tuning with chain-of-thought (CoT) prompting. MECOT explicitly enhances temporal awareness through event-level description generation, temporally sensitive prompt engineering, and structured reasoning. On VECTOR, MECOT significantly outperforms all baselines. Moreover, it yields consistent performance gains across mainstream video understanding tasks—including action recognition and temporal localization—demonstrating its generalizability. To foster reproducibility and community advancement, we publicly release our code, models, and the VECTOR dataset.

Technology Category

Application Category

📝 Abstract
Video Large Multimodal Models (VLMMs) have shown impressive performance in video understanding, yet their ability to accurately capture the temporal order of multiple events remains underexplored. We interestingly observe that, even when video frames are scrambled, models perform very well on the existing benchmarks by comprehensive experiments. This implies that VLMMs may not necessarily rely on accurate sequential processing of visual events, but instead depend on prior knowledge of typical scenarios to answer the question. To benchmark temporal understanding capabilities in VLMMs, we propose VECTOR, designed to explicitly assess a model's ability to identify the temporal order of events. On this benchmark, we observe that various VLMMs often fail to understand the orders of events. To address this, we propose MECOT (Multi-Event instruction fine-tuning with Chain-of-Thought), which (1) trains models on detailed, event-by-event video descriptions and (2) using chain-of-thought prompts at inference to enhance temporal awareness. MECOT outperforms prior arts on VECTOR as well as improving performance on existing video benchmarks, implying effectiveness of temporal understanding. We release our code, model and datasets.
Problem

Research questions and friction points this paper is trying to address.

Assessing VLMMs' ability to understand temporal event orders in videos
Addressing models' failure to capture correct sequences of multiple events
Enhancing temporal awareness through fine-tuning and chain-of-thought prompting
Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposed VECTOR benchmark for temporal order assessment
Introduced MECOT with event-by-event fine-tuning
Used chain-of-thought prompting to enhance temporal awareness
🔎 Similar Papers
No similar papers found.