When Facts Change: Probing LLMs on Evolving Knowledge with evolveQA

๐Ÿ“… 2025-10-21
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Large language models (LLMs) exhibit significant weaknesses in resolving factual knowledge conflicts arising from temporal evolution. Existing benchmarks predominantly rely on static, popular-entity knowledge bases (e.g., Wikidata), failing to enable fair evaluation across varying knowledge cutoff dates. To address this, we propose evolveQAโ€”the first dynamic knowledge question-answering benchmark built upon real-world temporal data from AWS, Azure, and WHO. evolveQA uniquely focuses on naturally occurring knowledge evolution: it mines temporal change paths from chronologically annotated corpora, constructs multi-timestamp QA pairs via template-based generation and rigorous human validation, and introduces a multi-paradigm knowledge probing evaluation protocol. Extensive experiments across 12 state-of-the-art open- and closed-weight LLMs reveal an average 31% performance drop on evolveQA compared to static QA tasks, starkly exposing LLMsโ€™ fundamental limitations in modeling dynamic, time-sensitive knowledge.

Technology Category

Application Category

๐Ÿ“ Abstract
LLMs often fail to handle temporal knowledge conflicts--contradictions arising when facts evolve over time within their training data. Existing studies evaluate this phenomenon through benchmarks built on structured knowledge bases like Wikidata, but they focus on widely-covered, easily-memorized popular entities and lack the dynamic structure needed to fairly evaluate LLMs with different knowledge cut-off dates. We introduce evolveQA, a benchmark specifically designed to evaluate LLMs on temporally evolving knowledge, constructed from 3 real-world, time-stamped corpora: AWS updates, Azure changes, and WHO disease outbreak reports. Our framework identifies naturally occurring knowledge evolution and generates questions with gold answers tailored to different LLM knowledge cut-off dates. Through extensive evaluation of 12 open and closed-source LLMs across 3 knowledge probing formats, we demonstrate significant performance drops of up to 31% on evolveQA compared to static knowledge questions.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs on temporal knowledge conflicts
Assessing performance with evolving real-world facts
Measuring accuracy drops across different knowledge cutoffs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses real-world time-stamped corpora for benchmarks
Generates questions tailored to different LLM knowledge cut-offs
Evaluates LLMs across three knowledge probing formats
๐Ÿ”Ž Similar Papers
No similar papers found.