Tracing the Arrow of Time: Diagnosing Temporal Information Flow in Video-LLMs

📅 2026-05-08
📈 Citations: 0
Influential: 0
📄 PDF

career value

217K/year
🤖 AI Summary
This study addresses the significant performance gap between video large language models (Video-LLMs) and humans in judging video temporal direction, a limitation whose cause has remained unclear. Through systematic analysis of how temporal information propagates across the visual encoder, projection module, and large language model, this work identifies the standard MLP projection layer as a critical bottleneck. To mitigate this, the authors propose a temporally preserving MLP projection architecture, integrated with a time-aware visual encoder and Arrow-of-Time supervision signals, substantially enhancing temporal reasoning capabilities. The resulting model achieves 98.1% accuracy on AoT$_{PPB}$—surpassing human performance—and yields improvements of 6.0 and 1.3 percentage points on VITATECS-Direction and TVBench, respectively.
📝 Abstract
The Arrow-of-Time (AoT) task, determining whether a video plays forward or backward by recognizing temporal irreversibility, is one humans solve with near-perfect accuracy, yet frontier Video Large Language Models (Video-LLMs) perform only modestly above chance. This gap raises a key question: do visual backbones fail to encode temporal information, or does information bottleneck lie elsewhere in the Video-LLM architecture? We address this question by isolating the vision encoder from the Video-LLM and tracing temporal information across the encoder, projector, and LLM. We find that video-centric encoders with explicit temporal modeling encode strong temporal signals, whereas frame-centric encoders do not. However, when video-centric representations are passed through a standard Video-LLM architecture, performance often collapses, revealing a bottleneck of temporal information flow. We identify projector design as a key factor: Q-Former disrupts temporal information, while a time-preserved MLP projection substantially improves the LLM's access to such information. Our layer-wise analysis further shows temporal representation dynamics across encoder layers. Guided by these findings, we build a Video-LLM with temporal-aware video-centric encoder, time-preserved projector, and AoT supervision, surpassing human performance on AoT$_{PPB}$ with 98.1\% accuracy, and improving broader temporal reasoning tasks by up to 6.0 points on VITATECS-Direction and 1.3 points on TVBench. Our results show that temporal reasoning in Video-LLMs requires both effective temporal encoding and reliable transfer of this information to the LLM.
Problem

Research questions and friction points this paper is trying to address.

Arrow-of-Time
Video-LLMs
temporal information flow
vision encoder
temporal reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Temporal Information Flow
Video-LLMs
Arrow of Time
Time-Preserved Projection
Vision Encoder
P
Peitao Han
The University of Osaka, Center for Information and Neural Networks, National Institute of Information and Communications Technology
F
Fei Cheng
Kyoto University
L
Lis K. Pereira
The University of Osaka, Center for Information and Neural Networks, National Institute of Information and Communications Technology
Qianying Liu
Qianying Liu
Researcher, National Institution of Informatics
natural language processingartificial intelligencemachine learning
S
Shigeru Kitazawa
The University of Osaka, Center for Information and Neural Networks, National Institute of Information and Communications Technology