🤖 AI Summary
This work addresses the limited capability of large language models (LLMs) to distinguish and recall temporal knowledge—such as legal amendments or scientific discoveries—versus static knowledge—like mathematical truths. To this end, we propose the first evaluation paradigm specifically designed for temporal knowledge evolution. We introduce ChroKnowBench, a multi-domain benchmark spanning three dimensions: domain diversity, fine-grained temporal dependency, and state evolution. Complementing this, we design ChroKnowledge—a sampling-driven evaluation framework—and ChroKnowPrompt, a stepwise temporal traversal prompting method. Experimental results reveal two prevalent limitations in current LLMs: temporal boundary truncation and strong dependence on input data formatting. ChroKnowPrompt consistently improves cross-temporal knowledge recall across both open- and closed-source LLMs, demonstrating its generality and effectiveness. Our contributions include: (1) the first dedicated benchmark and evaluation framework for temporal knowledge evolution; (2) a novel prompting strategy that enhances temporal reasoning; and (3) empirical insights into fundamental temporal reasoning bottlenecks in LLMs.
📝 Abstract
Large language models (LLMs) have brought significant changes to many aspects of our lives. However, assessing and ensuring their chronological knowledge remains challenging. Existing approaches fall short in addressing the temporal adaptability of knowledge, often relying on a fixed time-point view. To overcome this, we introduce ChroKnowBench, a benchmark dataset designed to evaluate chronologically accumulated knowledge across three key aspects: multiple domains, time dependency, temporal state. Our benchmark distinguishes between knowledge that evolves (e.g., personal history, scientific discoveries, amended laws) and knowledge that remain constant (e.g., mathematical truths, commonsense facts). Building on this benchmark, we present ChroKnowledge (Chronological Categorization of Knowledge), a novel sampling-based framework for evaluating LLMs' non-parametric chronological knowledge. Our evaluation led to the following observations: (1) The ability of eliciting temporal knowledge varies depending on the data format that model was trained on. (2) LLMs partially recall knowledge or show a cut-off at temporal boundaries rather than recalling all aspects of knowledge correctly. Thus, we apply our ChroKnowPrompt, an in-depth prompting to elicit chronological knowledge by traversing step-by-step through the surrounding time spans. We observe that it successfully recalls objects across both open-source and proprietary LLMs, demonstrating versatility, though it faces challenges with dynamic datasets and unstructured formats.