When Facts Change: Probing LLMs on Evolving Knowledge with evolveQA

📅 2025-10-21

📈 Citations: 0

✨ Influential: 0

career value

158K/year

🤖 AI Summary

Large language models (LLMs) exhibit significant weaknesses in resolving factual knowledge conflicts arising from temporal evolution. Existing benchmarks predominantly rely on static, popular-entity knowledge bases (e.g., Wikidata), failing to enable fair evaluation across varying knowledge cutoff dates. To address this, we propose evolveQA—the first dynamic knowledge question-answering benchmark built upon real-world temporal data from AWS, Azure, and WHO. evolveQA uniquely focuses on naturally occurring knowledge evolution: it mines temporal change paths from chronologically annotated corpora, constructs multi-timestamp QA pairs via template-based generation and rigorous human validation, and introduces a multi-paradigm knowledge probing evaluation protocol. Extensive experiments across 12 state-of-the-art open- and closed-weight LLMs reveal an average 31% performance drop on evolveQA compared to static QA tasks, starkly exposing LLMs’ fundamental limitations in modeling dynamic, time-sensitive knowledge.

Technology Category

Application Category

📝 Abstract

LLMs often fail to handle temporal knowledge conflicts--contradictions arising when facts evolve over time within their training data. Existing studies evaluate this phenomenon through benchmarks built on structured knowledge bases like Wikidata, but they focus on widely-covered, easily-memorized popular entities and lack the dynamic structure needed to fairly evaluate LLMs with different knowledge cut-off dates. We introduce evolveQA, a benchmark specifically designed to evaluate LLMs on temporally evolving knowledge, constructed from 3 real-world, time-stamped corpora: AWS updates, Azure changes, and WHO disease outbreak reports. Our framework identifies naturally occurring knowledge evolution and generates questions with gold answers tailored to different LLM knowledge cut-off dates. Through extensive evaluation of 12 open and closed-source LLMs across 3 knowledge probing formats, we demonstrate significant performance drops of up to 31% on evolveQA compared to static knowledge questions.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs on temporal knowledge conflicts

Assessing performance with evolving real-world facts

Measuring accuracy drops across different knowledge cutoffs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses real-world time-stamped corpora for benchmarks

Generates questions tailored to different LLM knowledge cut-offs

Evaluates LLMs across three knowledge probing formats

🔎 Similar Papers

Is Your LLM Outdated? A Deep Look at Temporal Generalization