TRAVELER: A Benchmark for Evaluating Temporal Reasoning across Vague, Implicit and Explicit References

📅 2025-05-02

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

Existing natural language understanding (NLU) benchmarks inadequately evaluate temporal reasoning, particularly regarding fuzzy, implicit, and explicit temporal references. Method: We introduce TRAVELER, the first synthetic benchmark systematically assessing all three temporal reference types—explicit, implicit (anchored to speech time), and fuzzy—distinguished for the first time in evaluation design. Leveraging crowdsourced annotation, we establish human-grounded labeling criteria for fuzzy answers; generate synthetic data; formulate temporal QA tasks; and conduct zero- and few-shot evaluations across multiple models. Results: On a 3,300-question test set, models perform well on explicit references and short event sequences but exhibit sharp performance degradation with increasing event complexity and temporal ambiguity; all models achieve lowest accuracy on fuzzy questions. Contribution: TRAVELER is the first fine-grained, temporally comprehensive benchmark covering explicit, implicit, and fuzzy temporal references, revealing a critical weakness of large language models in fuzzy temporal reasoning.

Technology Category

Application Category

📝 Abstract

Understanding and resolving temporal references is essential in Natural Language Understanding as we often refer to the past or future in daily communication. Although existing benchmarks address a system's ability to reason about and resolve temporal references, systematic evaluation of specific temporal references remains limited. Towards closing this gap, we introduce TRAVELER, a novel synthetic benchmark dataset that follows a Question Answering paradigm and consists of questions involving temporal references with the corresponding correct answers. TRAVELER assesses models' abilities to resolve explicit, implicit relative to speech time, and vague temporal references. Beyond investigating the performance of state-of-the-art LLMs depending on the type of temporal reference, our benchmark also allows evaluation of performance in relation to the length of the set of events. For the category of vague temporal references, ground-truth answers were established via human surveys on Prolific, following a procedure similar to the one from Kenneweg et al. To demonstrate the benchmark's applicability, we evaluate four state-of-the-art LLMs using a question-answering task encompassing 3,300 questions. Our findings show that while the benchmarked LLMs can answer questions over event sets with a handful of events and explicit temporal references successfully, performance clearly deteriorates with larger event set length and when temporal references get less explicit. Notably, the vague question category exhibits the lowest performance across all models. The benchmark is publicly available at: https://gitlab.ub.uni-bielefeld.de/s.kenneweg/TRAVELER

Problem

Research questions and friction points this paper is trying to address.

Evaluating temporal reasoning in vague, implicit, explicit references

Assessing model performance across varying event set lengths

Benchmarking LLMs on resolving diverse temporal reference types

Innovation

Methods, ideas, or system contributions that make the work stand out.

Synthetic benchmark dataset for temporal reasoning

Evaluates explicit, implicit, and vague references

Assesses performance across varying event lengths

🔎 Similar Papers

Time Awareness in Large Language Models: Benchmarking Fact Recall Across Time