Time Awareness in Large Language Models: Benchmarking Fact Recall Across Time

📅 2024-09-20
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) exhibit fragility and inconsistency in recalling time-sensitive facts. To address this, we introduce TimeShift—the first daily-granularity temporal awareness evaluation framework—accompanied by a benchmark dataset spanning 2018–2024 and comprising over 8,000 fine-grained temporal events across politics, science, and business. Our methodology includes temporally annotated data construction, time-sensitive question-answering design, cross-model comparative evaluation protocols, and robustness testing under factual paraphrasing. Key findings reveal that base models often outperform instruction-tuned variants; answer accuracy degrades significantly with finer temporal granularity; and temporal consistency collapses markedly under paraphrased factual queries. This work provides the first quantitative characterization of temporal awareness deficiencies in mainstream LLMs, delivering a reproducible benchmark and diagnostic toolkit to advance temporally adaptive modeling.

Technology Category

Application Category

📝 Abstract
Who is the US President? The answer changes depending on when the question is asked. While large language models (LLMs) are evaluated on various reasoning tasks, they often miss a crucial dimension: time. In real-world scenarios, the correctness of answers is frequently tied to temporal context. To address this gap, we present a novel framework and dataset spanning over 8,000 events from 2018 to 2024, annotated with day-level granularity and sourced globally across domains such as politics, science, and business. Our TimeShift evaluation method systematically probes LLMs for temporal reasoning, revealing that base models often outperform instruction-tuned and synthetic-trained counterparts on time-sensitive recall. Additionally, we find that even large-scale models exhibit brittleness in handling paraphrased facts, highlighting unresolved challenges in temporal consistency. By identifying these limitations, our work provides a significant step toward advancing time-aware language models capable of adapting to the dynamic nature of real-world knowledge.
Problem

Research questions and friction points this paper is trying to address.

Assessing LLMs' time-sensitive fact recall
Developing time-aware evaluation framework
Identifying temporal reasoning limitations in LLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

TimeShift evaluation method introduced
Dataset with day-level granularity used
Base models outperform on temporal reasoning
🔎 Similar Papers
No similar papers found.
D
David Herel
FEE, CTU in Prague; CIIRC, Czech Technical University
V
Vojtěch Bartek
FEE, CTU in Prague
T
Tomáš Mikolov
CIIRC, Czech Technical University in Prague