UnSeenTimeQA: Time-Sensitive Question-Answering Beyond LLMs' Memorization

📅 2024-07-03

🏛️ arXiv.org

📈 Citations: 4

✨ Influential: 0

career value

178K/year

🤖 AI Summary

This paper addresses the fundamental limitation of large language models (LLMs) in temporal reasoning for time-sensitive question answering (TSQA), where existing benchmarks conflate performance with memorized knowledge or web-retrievable facts. To isolate pure temporal reasoning, we introduce UnSeenTimeQA—the first data-contamination-free TSQA benchmark. It constructs synthetic temporal event scenarios grounded in novel, non-real-world facts and poses three categories of time-sensitive questions, deliberately eliminating reliance on pretraining knowledge or external retrieval. Our evaluation paradigm enforces zero-web-access and anchors facts in synthetic temporality, while its structured framework supports long-range dependencies and parallel timelines. Experiments across five state-of-the-art LLMs reveal a significant accuracy drop on UnSeenTimeQA compared to real-fact TSQA benchmarks—demonstrating, for the first time, that current LLMs fundamentally lack intrinsic temporal cognition, not merely factual recall.

Technology Category

Application Category

📝 Abstract

This paper introduces UnSeenTimeQA, a novel data contamination-free time-sensitive question-answering (TSQA) benchmark. It differs from existing TSQA benchmarks by avoiding web-searchable queries grounded in the real-world. We present a series of time-sensitive event scenarios based on synthetically generated facts. It requires large language models (LLMs) to engage in genuine temporal reasoning without depending on the factual knowledge acquired during the pre-training phase. We designed three types of time-sensitive questions to test LLMs' temporal reasoning abilities over sequential and parallel event occurrences. Our evaluation of five LLMs on synthetic fact-based TSQA reveals mixed results: while they perform well on simpler subsets, their overall performance remains inferior as compared to real-world fact-based TSQA. Error analysis of LLM-generated reasoning chains indicates that LLMs face difficulties in reasoning over long-range event dependencies and parallel event timelines that unfold concurrently.

Problem

Research questions and friction points this paper is trying to address.

Creating a contamination-free benchmark for time-sensitive QA

Testing LLMs' temporal reasoning without pre-trained knowledge

Evaluating LLMs' ability to handle long-range and parallel events

Innovation

Methods, ideas, or system contributions that make the work stand out.

Synthetic fact-based time-sensitive event scenarios

On-demand data generation to prevent leakage

Three types of temporal reasoning questions

🔎 Similar Papers

Time Awareness in Large Language Models: Benchmarking Fact Recall Across Time