TIME: A Multi-level Benchmark for Temporal Reasoning of LLMs in Real-World Scenarios

📅 2025-05-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current large language models exhibit significant deficiencies in real-world temporal reasoning—particularly in handling dense chronological information, dynamic event evolution, and social temporal dependencies. To address this, we introduce TIME, the first comprehensive, multi-level temporal reasoning benchmark grounded in real-world scenarios, comprising 38,522 question-answer pairs, three curated subsets, and 11 fine-grained tasks. We propose a hierarchical evaluation framework and release TIME-Lite, a lightweight human-annotated subset. Our work is the first to empirically characterize the impact of test-time scaling on temporal reasoning performance. TIME is constructed from diverse, authentic textual sources using rigorous QA paradigms, enabling cross-scenario performance attribution analysis. Extensive evaluation reveals consistent weaknesses of mainstream models on dynamic and socially embedded temporal tasks. All code and datasets are publicly available on Hugging Face to foster reproducible research in temporal reasoning.

Technology Category

Application Category

📝 Abstract
Temporal reasoning is pivotal for Large Language Models (LLMs) to comprehend the real world. However, existing works neglect the real-world challenges for temporal reasoning: (1) intensive temporal information, (2) fast-changing event dynamics, and (3) complex temporal dependencies in social interactions. To bridge this gap, we propose a multi-level benchmark TIME, designed for temporal reasoning in real-world scenarios. TIME consists of 38,522 QA pairs, covering 3 levels with 11 fine-grained sub-tasks. This benchmark encompasses 3 sub-datasets reflecting different real-world challenges: TIME-Wiki, TIME-News, and TIME-Dial. We conduct extensive experiments on reasoning models and non-reasoning models. And we conducted an in-depth analysis of temporal reasoning performance across diverse real-world scenarios and tasks, and summarized the impact of test-time scaling on temporal reasoning capabilities. Additionally, we release TIME-Lite, a human-annotated subset to foster future research and standardized evaluation in temporal reasoning. The code is available at https://github.com/sylvain-wei/TIME , and the dataset is available at https://huggingface.co/datasets/SylvainWei/TIME .
Problem

Research questions and friction points this paper is trying to address.

Assessing LLMs' temporal reasoning in real-world scenarios
Addressing intensive, dynamic, and complex temporal dependencies
Providing a multi-level benchmark for diverse temporal tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-level benchmark TIME for temporal reasoning
38,522 QA pairs covering 3 levels
Includes TIME-Wiki, TIME-News, TIME-Dial datasets
🔎 Similar Papers
No similar papers found.
S
Shaohang Wei
State Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University
W
Wei Li
State Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University
Feifan Song
Feifan Song
Peking University
Natural Language Processing
Wen Luo
Wen Luo
Peking University
T
Tianyi Zhuang
Huawei Noah’s Ark Lab
Haochen Tan
Haochen Tan
City University of Hong Kong
NLPDeep Learning
Zhijiang Guo
Zhijiang Guo
HKUST (GZ) | HKUST
Natural Language ProcessingMachine LearningLarge Language Models
H
Houfeng Wang
State Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University