Time Blindness: Why Video-Language Models Can't See What Humans Can?

📅 2025-05-30

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

This paper identifies a critical “temporal blindness” in state-of-the-art video-language models (VLMs): when spatial cues are occluded (e.g., by pure noise frames), VLMs achieve 0% accuracy on temporal pattern recognition—while humans maintain >98%—revealing an overreliance on frame-level spatial features and a severe deficit in pure temporal reasoning. Method: We introduce SpookyBench, the first open-source benchmark explicitly designed to evaluate *pure temporal perception*, constructed via temporally structured noise-frame encodings of visual patterns. Through human behavioral experiments, multi-model cross-scale ablation, and low-spatial signal-to-noise ratio (SNR) training analysis, we validate the universality of this limitation across architectures and scales, and demonstrate rapid degradation of temporal understanding as spatial SNR decreases. Contribution/Results: We provide the first systematic discovery and quantification of VLMs’ temporal blindness, establish the necessity of spatiotemporal disentanglement, and publicly release SpookyBench to advance research in temporal-aware video understanding.

Technology Category

Application Category

📝 Abstract

Recent advances in vision-language models (VLMs) have made impressive strides in understanding spatio-temporal relationships in videos. However, when spatial information is obscured, these models struggle to capture purely temporal patterns. We introduce $ extbf{SpookyBench}$, a benchmark where information is encoded solely in temporal sequences of noise-like frames, mirroring natural phenomena from biological signaling to covert communication. Interestingly, while humans can recognize shapes, text, and patterns in these sequences with over 98% accuracy, state-of-the-art VLMs achieve 0% accuracy. This performance gap highlights a critical limitation: an over-reliance on frame-level spatial features and an inability to extract meaning from temporal cues. Furthermore, when trained in data sets with low spatial signal-to-noise ratios (SNR), temporal understanding of models degrades more rapidly than human perception, especially in tasks requiring fine-grained temporal reasoning. Overcoming this limitation will require novel architectures or training paradigms that decouple spatial dependencies from temporal processing. Our systematic analysis shows that this issue persists across model scales and architectures. We release SpookyBench to catalyze research in temporal pattern recognition and bridge the gap between human and machine video understanding. Dataset and code has been made available on our project website: https://timeblindness.github.io/.

Problem

Research questions and friction points this paper is trying to address.

VLMs fail to recognize temporal patterns in noise-like frames

Humans outperform VLMs in temporal sequence recognition by 98%

Current VLMs over-rely on spatial features, lacking temporal reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introducing SpookyBench for temporal sequence analysis

Decoupling spatial dependencies from temporal processing

Releasing dataset to improve temporal pattern recognition

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs