Training Dynamics of Parametric and In-Context Knowledge Utilization in Language Models

📅 2025-09-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) face a critical knowledge arbitration challenge in retrieval-augmented generation (RAG): reconciling parametric knowledge with contextual knowledge retrieved from external sources—blindly incorporating retrieved content risks noise contamination, while over-relying on parametric knowledge undermines the utility of retrieval. To address this, we construct a controllable synthetic biographical corpus and conduct the first controlled training study to systematically uncover the mechanisms underlying knowledge arbitration strategy formation. Our findings reveal that moderate information inconsistency and distributional bias within the training corpus—not defects to be eliminated—are in fact instrumental in fostering robust arbitration capabilities. Moreover, intra-document factual repetition significantly enhances the synergistic utilization of both parametric and contextual knowledge. This work establishes a theoretical foundation for adaptive knowledge integration in LLMs and introduces a novel data construction paradigm grounded in controlled synthesis and deliberate corpus design.

Technology Category

Application Category

📝 Abstract
Large language models often encounter conflicts between in-context knowledge retrieved at inference time and parametric knowledge acquired during pretraining. Models that accept external knowledge uncritically are vulnerable to misinformation, whereas models that adhere rigidly to parametric knowledge fail to benefit from retrieval. Despite the widespread adoption of retrieval-augmented generation, we still lack a systematic understanding of what shapes knowledge-arbitration strategies during training. This gap risks producing pretrained models with undesirable arbitration behaviors and, consequently, wasting substantial computational resources after the pretraining budget has already been spent. To address this problem, we present the first controlled study of how training conditions influence models' use of in-context and parametric knowledge, and how they arbitrate between them. We train transformer-based language models on a synthetic biographies corpus while systematically controlling various conditions. Our experiments reveal that intra-document repetition of facts fosters the development of both parametric and in-context capabilities. Moreover, training on a corpus that contains inconsistent information or distributional skew encourages models to develop robust strategies for leveraging parametric and in-context knowledge. Rather than viewing these non-ideal properties as artifacts to remove, our results indicate that they are important for learning robust arbitration. These insights offer concrete, empirical guidance for pretraining models that harmoniously integrate parametric and in-context knowledge.
Problem

Research questions and friction points this paper is trying to address.

Investigating training dynamics of knowledge arbitration strategies
Understanding conflicts between parametric and in-context knowledge utilization
Developing robust models that integrate both knowledge types effectively
Innovation

Methods, ideas, or system contributions that make the work stand out.

Training transformer models on synthetic biographies corpus
Using intra-document fact repetition to develop knowledge capabilities
Leveraging inconsistent information for robust arbitration strategies
🔎 Similar Papers
No similar papers found.