🤖 AI Summary
Large language models (LLMs) exhibit fragility and inconsistency in recalling time-sensitive facts. To address this, we introduce TimeShift—the first daily-granularity temporal awareness evaluation framework—accompanied by a benchmark dataset spanning 2018–2024 and comprising over 8,000 fine-grained temporal events across politics, science, and business. Our methodology includes temporally annotated data construction, time-sensitive question-answering design, cross-model comparative evaluation protocols, and robustness testing under factual paraphrasing. Key findings reveal that base models often outperform instruction-tuned variants; answer accuracy degrades significantly with finer temporal granularity; and temporal consistency collapses markedly under paraphrased factual queries. This work provides the first quantitative characterization of temporal awareness deficiencies in mainstream LLMs, delivering a reproducible benchmark and diagnostic toolkit to advance temporally adaptive modeling.
📝 Abstract
Who is the US President? The answer changes depending on when the question is asked. While large language models (LLMs) are evaluated on various reasoning tasks, they often miss a crucial dimension: time. In real-world scenarios, the correctness of answers is frequently tied to temporal context. To address this gap, we present a novel framework and dataset spanning over 8,000 events from 2018 to 2024, annotated with day-level granularity and sourced globally across domains such as politics, science, and business. Our TimeShift evaluation method systematically probes LLMs for temporal reasoning, revealing that base models often outperform instruction-tuned and synthetic-trained counterparts on time-sensitive recall. Additionally, we find that even large-scale models exhibit brittleness in handling paraphrased facts, highlighting unresolved challenges in temporal consistency. By identifying these limitations, our work provides a significant step toward advancing time-aware language models capable of adapting to the dynamic nature of real-world knowledge.