Deep Temporal Reasoning in Video Language Models: A Cross-Linguistic Evaluation of Action Duration and Completion through Perfect Times

📅 2025-06-01

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

This work investigates whether video-language models (VLMs) genuinely comprehend the temporal dynamics of actions—or merely exploit superficial visual cues to distinguish perfective (telic) from imperfective (durative) events. Method: We introduce Perfect Times, the first cross-lingual (Chinese, English, French, German) benchmark integrating grammatical aspectual features with real-world videos; it comprises event telicity annotations, multilingual video question answering, and vision–language temporal alignment evaluation. Crucially, we pioneer the application of linguistic aspect theory to VLM temporal reasoning assessment. Results: State-of-the-art VLMs achieve significantly lower average accuracy than humans across all four languages, revealing fundamental deficiencies in modeling the causal structure and temporal boundaries of actions. This study is the first to systematically expose VLMs’ core limitations in cross-lingual vision–language temporal reasoning and provides a scalable, theoretically grounded evaluation framework for future research.

Technology Category

Application Category

📝 Abstract

Human perception of events is intrinsically tied to distinguishing between completed (perfect and telic) and ongoing (durative) actions, a process mediated by both linguistic structure and visual cues. In this work, we introduce the extbf{Perfect Times} dataset, a novel, quadrilingual (English, Italian, Russian, and Japanese) multiple-choice question-answering benchmark designed to assess video-language models (VLMs) on temporal reasoning. By pairing everyday activity videos with event completion labels and perfectivity-tailored distractors, our dataset probes whether models truly comprehend temporal dynamics or merely latch onto superficial markers. Experimental results indicate that state-of-the-art models, despite their success on text-based tasks, struggle to mirror human-like temporal and causal reasoning grounded in video. This study underscores the necessity of integrating deep multimodal cues to capture the nuances of action duration and completion within temporal and causal video dynamics, setting a new standard for evaluating and advancing temporal reasoning in VLMs.

Problem

Research questions and friction points this paper is trying to address.

Assessing VLMs' temporal reasoning in multilingual video contexts

Evaluating action duration and completion understanding in VLMs

Challenging VLMs to distinguish completed vs. ongoing actions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces quadrilingual Perfect Times dataset

Probes temporal reasoning in video-language models

Integrates multimodal cues for action dynamics

🔎 Similar Papers

Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models