Understanding Fact Recall in Language Models: Why Two-Stage Training Encourages Memorization but Mixed Training Teaches Knowledge

📅 2025-05-22

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

Language models’ factual recall is constrained by training paradigms: two-stage training often induces rote memorization, while mixed training improves performance but its underlying mechanism remains unclear. Method: We propose cross-task gradient tracing to systematically analyze parameter dynamics of Llama-3.2B and Pythia-2.8B on a synthetic factual dataset. Contribution/Results: We find that mixed training fosters larger, more concentrated shared parameter clusters—indicating that knowledge generalization arises from coordinated parameter optimization rather than localized memory strengthening. Empirically, this mechanism enhances robustness and cross-task transferability of factual recall. To our knowledge, this is the first study to elucidate how mixed training improves factual retrieval capability from an emergent parameter perspective, offering a novel pathway toward interpretable and generalizable knowledge modeling.

Technology Category

Application Category

📝 Abstract

Fact recall, the ability of language models (LMs) to retrieve specific factual knowledge, remains a challenging task despite their impressive general capabilities. Common training strategies often struggle to promote robust recall behavior with two-stage training, which first trains a model with fact-storing examples (e.g., factual statements) and then with fact-recalling examples (question-answer pairs), tending to encourage rote memorization rather than generalizable fact retrieval. In contrast, mixed training, which jointly uses both types of examples, has been empirically shown to improve the ability to recall facts, but the underlying mechanisms are still poorly understood. In this work, we investigate how these training strategies affect how model parameters are shaped during training and how these differences relate to their ability to recall facts. We introduce cross-task gradient trace to identify shared parameters, those strongly influenced by both fact-storing and fact-recalling examples. Our analysis on synthetic fact recall datasets with the Llama-3.2B and Pythia-2.8B models reveals that mixed training encouraging a larger and more centralized set of shared parameters. These findings suggest that the emergence of parameters may play a key role in enabling LMs to generalize factual knowledge across task formulations.

Problem

Research questions and friction points this paper is trying to address.

Investigates how training strategies affect fact recall in language models

Compares two-stage vs mixed training for memorization vs knowledge generalization

Analyzes shared parameters' role in improving factual knowledge retrieval

Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage training promotes rote memorization

Mixed training improves fact recall ability

Cross-task gradient trace identifies shared parameters

🔎 Similar Papers

Co-occurrence is not Factual Association in Language Models