A Study into Investigating Temporal Robustness of LLMs

📅 2025-03-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the insufficient temporal robustness of large language models (LLMs) in time reasoning and temporal factual knowledge tasks. To this end, we systematically construct the first multidimensional temporal robustness evaluation framework covering eight challenging categories: temporal reconstruction, granularity transformation, directional inference, among others. Under a zero-shot setting, we evaluate six mainstream LLMs and propose three novel techniques: temporal-aware prompt engineering, temporally equivalent question reformulation, and fine-grained temporal reference comparative analysis. Furthermore, we implement a user-query-driven real-time automatic temporal robustness classifier. Experimental results reveal pervasive deficiencies in LLMs’ temporal semantic modeling. Leveraging insights from our evaluation, we optimize temporal question answering performance—achieving up to a 55% improvement. Our framework provides both diagnostic capability and actionable guidance for enhancing temporal reasoning in LLMs.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) encapsulate a surprising amount of factual world knowledge. However, their performance on temporal questions and historical knowledge is limited because they often cannot understand temporal scope and orientation or neglect the temporal aspect altogether. In this study, we aim to measure precisely how robust LLMs are for question answering based on their ability to process temporal information and perform tasks requiring temporal reasoning and temporal factual knowledge. Specifically, we design eight time-sensitive robustness tests for factual information to check the sensitivity of six popular LLMs in the zero-shot setting. Overall, we find LLMs lacking temporal robustness, especially to temporal reformulations and the use of different granularities of temporal references. We show how a selection of these eight tests can be used automatically to judge a model's temporal robustness for user questions on the fly. Finally, we apply the findings of this study to improve the temporal QA performance by up to 55 percent.
Problem

Research questions and friction points this paper is trying to address.

Assessing LLMs' robustness in temporal question answering
Evaluating LLMs' sensitivity to temporal references and reasoning
Improving temporal QA performance through targeted robustness tests
Innovation

Methods, ideas, or system contributions that make the work stand out.

Designs eight time-sensitive robustness tests
Automatically judges model's temporal robustness
Improves temporal QA performance by 55%