Tracing the Arrow of Time: Diagnosing Temporal Information Flow in Video-LLMs

📅 2026-05-08

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

This study addresses the significant performance gap between video large language models (Video-LLMs) and humans in judging video temporal direction, a limitation whose cause has remained unclear. Through systematic analysis of how temporal information propagates across the visual encoder, projection module, and large language model, this work identifies the standard MLP projection layer as a critical bottleneck. To mitigate this, the authors propose a temporally preserving MLP projection architecture, integrated with a time-aware visual encoder and Arrow-of-Time supervision signals, substantially enhancing temporal reasoning capabilities. The resulting model achieves 98.1% accuracy on AoT$_{PPB}$—surpassing human performance—and yields improvements of 6.0 and 1.3 percentage points on VITATECS-Direction and TVBench, respectively.

📝 Abstract

The Arrow-of-Time (AoT) task, determining whether a video plays forward or backward by recognizing temporal irreversibility, is one humans solve with near-perfect accuracy, yet frontier Video Large Language Models (Video-LLMs) perform only modestly above chance. This gap raises a key question: do visual backbones fail to encode temporal information, or does information bottleneck lie elsewhere in the Video-LLM architecture? We address this question by isolating the vision encoder from the Video-LLM and tracing temporal information across the encoder, projector, and LLM. We find that video-centric encoders with explicit temporal modeling encode strong temporal signals, whereas frame-centric encoders do not. However, when video-centric representations are passed through a standard Video-LLM architecture, performance often collapses, revealing a bottleneck of temporal information flow. We identify projector design as a key factor: Q-Former disrupts temporal information, while a time-preserved MLP projection substantially improves the LLM's access to such information. Our layer-wise analysis further shows temporal representation dynamics across encoder layers. Guided by these findings, we build a Video-LLM with temporal-aware video-centric encoder, time-preserved projector, and AoT supervision, surpassing human performance on AoT$_{PPB}$ with 98.1\% accuracy, and improving broader temporal reasoning tasks by up to 6.0 points on VITATECS-Direction and 1.3 points on TVBench. Our results show that temporal reasoning in Video-LLMs requires both effective temporal encoding and reliable transfer of this information to the LLM.

Problem

Research questions and friction points this paper is trying to address.

Arrow-of-Time

Video-LLMs

temporal information flow

vision encoder

temporal reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Temporal Information Flow

Video-LLMs

Arrow of Time