Burn After Reading: Do Multimodal Large Language Models Truly Capture Order of Events in Image Sequences?

📅 2025-06-12

📈 Citations: 0

✨ Influential: 0

career value

156K/year

🤖 AI Summary

Can multimodal large language models (MLLMs) accurately comprehend temporal relationships among events in image sequences? This paper introduces TempVS, the first benchmark dedicated to fine-grained temporal reasoning, comprising three tasks—event relation inference, sentence ordering, and image ordering—that jointly demand spatiotemporal localization and cross-modal reasoning over visual and linguistic inputs. TempVS innovatively incorporates foundational localization tests, establishing a hierarchical evaluation framework: “localization → relation → ordering.” A systematic assessment across 38 state-of-the-art MLLMs reveals pervasive and substantial deficits in temporal understanding, with performance markedly lagging behind human capabilities. We provide granular failure analysis and publicly release the full dataset, implementation code, and evaluation framework to establish a foundational benchmark for advancing research on temporal cognition in MLLMs.

Technology Category

Application Category

📝 Abstract

This paper introduces the TempVS benchmark, which focuses on temporal grounding and reasoning capabilities of Multimodal Large Language Models (MLLMs) in image sequences. TempVS consists of three main tests (i.e., event relation inference, sentence ordering and image ordering), each accompanied with a basic grounding test. TempVS requires MLLMs to rely on both visual and linguistic modalities to understand the temporal order of events. We evaluate 38 state-of-the-art MLLMs, demonstrating that models struggle to solve TempVS, with a substantial performance gap compared to human capabilities. We also provide fine-grained insights that suggest promising directions for future research. Our TempVS benchmark data and code are available at https://github.com/yjsong22/TempVS.

Problem

Research questions and friction points this paper is trying to address.

Assessing MLLMs' ability to understand event order in image sequences

Evaluating temporal grounding and reasoning in multimodal language models

Identifying performance gaps between MLLMs and humans in temporal tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces TempVS benchmark for MLLMs

Tests event relation and ordering capabilities

Evaluates 38 models revealing performance gaps

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs