Natural Context Drift Undermines the Natural Language Understanding of Large Language Models

📅 2025-08-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates the impact of natural contextual evolution on large language models’ (LLMs) question-answering performance. We address the problem that minor, semantically equivalent context perturbations—such as human-edited paragraph variants—induce substantial degradation in model comprehension. To this end, we propose the first evaluation framework to quantify “natural contextual drift”: it hierarchically constructs context variants grounded in semantic similarity and conducts systematic experiments across six QA benchmarks and eight state-of-the-art generative LLMs. Results reveal that even when critical information remains intact, subtle contextual changes significantly impair performance—e.g., average accuracy drops by over 30% on BoolQ, with some models exhibiting sensitivity slopes exceeding 70%. These findings empirically demonstrate the nonlinear, nontrivial influence of natural contextual drift and expose the high fragility of current LLMs’ linguistic understanding. Our work establishes a novel paradigm for robustness-aware modeling and evaluation.

Technology Category

Application Category

📝 Abstract
How does the natural evolution of context paragraphs affect question answering in generative Large Language Models (LLMs)? To investigate this, we propose a framework for curating naturally evolved, human-edited variants of reading passages from contemporary QA benchmarks and for analyzing LLM performance across a range of semantic similarity scores, which quantify how closely each variant aligns with content seen during pretraining. Using this framework, we evaluate six QA datasets and eight LLMs with publicly available training data. Our experiments reveal that LLM performance declines as reading passages naturally diverge from the versions encountered during pretraining-even when the question and all necessary information remains present at inference time. For instance, average model accuracy on BoolQ drops by over 30% from the highest to lowest similarity bins, with slopes exceeding 70 across several LLMs. These findings suggest that natural text evolution poses a significant challenge to the language understanding capabilities of LLMs.
Problem

Research questions and friction points this paper is trying to address.

Investigating natural context evolution effects on LLM question answering
Measuring performance decline as passages diverge from pretraining content
Evaluating natural text evolution challenge to LLM understanding capabilities
Innovation

Methods, ideas, or system contributions that make the work stand out.

Framework for curating naturally evolved human-edited passage variants
Analyzing LLM performance across semantic similarity scores
Evaluating six QA datasets and eight LLMs with training data
🔎 Similar Papers
No similar papers found.