Time Awareness in Large Language Models: Benchmarking Fact Recall Across Time

📅 2024-09-20

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

Large language models (LLMs) exhibit fragility and inconsistency in recalling time-sensitive facts. To address this, we introduce TimeShift—the first daily-granularity temporal awareness evaluation framework—accompanied by a benchmark dataset spanning 2018–2024 and comprising over 8,000 fine-grained temporal events across politics, science, and business. Our methodology includes temporally annotated data construction, time-sensitive question-answering design, cross-model comparative evaluation protocols, and robustness testing under factual paraphrasing. Key findings reveal that base models often outperform instruction-tuned variants; answer accuracy degrades significantly with finer temporal granularity; and temporal consistency collapses markedly under paraphrased factual queries. This work provides the first quantitative characterization of temporal awareness deficiencies in mainstream LLMs, delivering a reproducible benchmark and diagnostic toolkit to advance temporally adaptive modeling.

Technology Category

Application Category

📝 Abstract

Who is the US President? The answer changes depending on when the question is asked. While large language models (LLMs) are evaluated on various reasoning tasks, they often miss a crucial dimension: time. In real-world scenarios, the correctness of answers is frequently tied to temporal context. To address this gap, we present a novel framework and dataset spanning over 8,000 events from 2018 to 2024, annotated with day-level granularity and sourced globally across domains such as politics, science, and business. Our TimeShift evaluation method systematically probes LLMs for temporal reasoning, revealing that base models often outperform instruction-tuned and synthetic-trained counterparts on time-sensitive recall. Additionally, we find that even large-scale models exhibit brittleness in handling paraphrased facts, highlighting unresolved challenges in temporal consistency. By identifying these limitations, our work provides a significant step toward advancing time-aware language models capable of adapting to the dynamic nature of real-world knowledge.

Problem

Research questions and friction points this paper is trying to address.

Assessing LLMs' time-sensitive fact recall

Developing time-aware evaluation framework

Identifying temporal reasoning limitations in LLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

TimeShift evaluation method introduced

Dataset with day-level granularity used

Base models outperform on temporal reasoning

🔎 Similar Papers

No similar papers found.