🤖 AI Summary
This paper identifies a pervasive “lost-in-the-later” phenomenon in large language models (LLMs) during open-ended question answering: LLMs systematically underutilize information positioned later in long contexts, leading to factual inconsistency and hallucination—exacerbated notably by reasoning models and chain-of-thought prompting. To address this, we propose CoPE, a comprehensive evaluation framework, and introduce MultiWikiAtomic, a multilingual benchmark dataset designed to systematically quantify contextual knowledge utilization. We provide the first empirical evidence and quantitative measurement of this positional bias. Furthermore, we propose a context-knowledge-aware (CK) prompting method that significantly improves factual accuracy and reduces hallucination in summarization tasks. Our work establishes a novel paradigm for synergistic modeling of contextual and parametric knowledge and delivers a reproducible, rigorous evaluation benchmark for future research.
📝 Abstract
Large language models are capable of leveraging both contextual and parametric knowledge but how they prioritize and integrate these sources remains underexplored. We introduce CoPE, a novel evaluation framework that systematically measures contextual knowledge (CK) and parametric knowledge (PK) across models and languages. Using our MultiWikiAtomic dataset in English, Spanish, and Danish, we analyze how large language models (LLMs) integrate context, prioritize information, and incorporate PK in open-ended question answering. Our analysis uncovers a phenomenon we call lost-in-the-later, where LLMs tend to overlook or deprioritize information that appears later in a given context, revealing a strong positional bias that affects contextual grounding. We further find that reasoning models, as well as non-reasoning models prompted with chain-of-thought (CoT), use context even less than non-reasoning models without CoT and fail to mitigate the lost-in-the-later effect. CoT prompting, in particular, results in lower recall and shorter responses, leading to degraded contextual grounding. Based on these insights, we design prompt-based methods to effectively leverage input context. A case study applying CoPE to summarization demonstrates that CK-informed prompting improves factual grounding and reduces hallucination.