๐ค AI Summary
Large language models (LLMs) exhibit significant weaknesses in resolving factual knowledge conflicts arising from temporal evolution. Existing benchmarks predominantly rely on static, popular-entity knowledge bases (e.g., Wikidata), failing to enable fair evaluation across varying knowledge cutoff dates. To address this, we propose evolveQAโthe first dynamic knowledge question-answering benchmark built upon real-world temporal data from AWS, Azure, and WHO. evolveQA uniquely focuses on naturally occurring knowledge evolution: it mines temporal change paths from chronologically annotated corpora, constructs multi-timestamp QA pairs via template-based generation and rigorous human validation, and introduces a multi-paradigm knowledge probing evaluation protocol. Extensive experiments across 12 state-of-the-art open- and closed-weight LLMs reveal an average 31% performance drop on evolveQA compared to static QA tasks, starkly exposing LLMsโ fundamental limitations in modeling dynamic, time-sensitive knowledge.
๐ Abstract
LLMs often fail to handle temporal knowledge conflicts--contradictions arising when facts evolve over time within their training data. Existing studies evaluate this phenomenon through benchmarks built on structured knowledge bases like Wikidata, but they focus on widely-covered, easily-memorized popular entities and lack the dynamic structure needed to fairly evaluate LLMs with different knowledge cut-off dates. We introduce evolveQA, a benchmark specifically designed to evaluate LLMs on temporally evolving knowledge, constructed from 3 real-world, time-stamped corpora: AWS updates, Azure changes, and WHO disease outbreak reports. Our framework identifies naturally occurring knowledge evolution and generates questions with gold answers tailored to different LLM knowledge cut-off dates. Through extensive evaluation of 12 open and closed-source LLMs across 3 knowledge probing formats, we demonstrate significant performance drops of up to 31% on evolveQA compared to static knowledge questions.