Seeing the Arrow of Time in Large Multimodal Models

📅 2025-06-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current large multimodal models (LMMs) lack awareness of video temporal irreversibility—the “arrow of time”—hindering deep temporal reasoning. Method: We propose ArrowRL, the first reinforcement learning–based (PPO) training framework for time-direction awareness. It introduces a reverse-direction reward mechanism grounded in explanatory discrepancies between forward and backward frame sequences, explicitly guiding models to learn temporal flow. The method integrates temporal frame shuffling, multimodal alignment modeling, and adaptive reward shaping. Contribution/Results: We introduce AoTBench, the first multi-dimensional benchmark dedicated to time-direction reasoning. Experiments show that ArrowRL significantly improves directional discrimination on AoTBench and achieves up to 20% absolute accuracy gains on standard video question answering benchmarks, demonstrating that explicit modeling of the arrow of time fundamentally enhances LMM capabilities.

Technology Category

Application Category

📝 Abstract
The Arrow of Time (AoT)-time's irreversible flow shaping physical events-is fundamental to video comprehension, yet remains a significant challenge for modern large multimodal models (LMMs). Current LMMs struggle to perceive and utilize temporal directionality in video when responding to language queries, obstructing deeper temporal understanding. We tackle this deficiency by first providing a critical analysis of existing benchmarks and models. We then introduce ArrowRL, a reinforcement learning (RL)-based training strategy with an innovative reverse reward that instills AoT awareness by encouraging divergent video interpretations between forward and reversed visual frames. For rigorous evaluation, we additionally develop AoTBench, a new multi-faceted benchmark probing temporally challenging questions. Experiments show ArrowRL greatly advances temporal perception: it not only achieves substantial improvements on our challenging AoTBench but also demonstrably boosts performance on standard video question answering (VQA) benchmarks (with peak accuracy gains reaching over 20% and 10% respectively). This validates ArrowRL's effectiveness and highlights the critical need for dedicated AoT understanding in LMMs.
Problem

Research questions and friction points this paper is trying to address.

LMMs struggle with temporal directionality in video comprehension
Current models lack awareness of Arrow of Time (AoT)
Need for dedicated AoT understanding in multimodal models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement learning with reverse reward
ArrowRL for temporal awareness
AoTBench for rigorous evaluation
🔎 Similar Papers
No similar papers found.