Understanding Fact Recall in Language Models: Why Two-Stage Training Encourages Memorization but Mixed Training Teaches Knowledge

📅 2025-05-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Language models’ factual recall is constrained by training paradigms: two-stage training often induces rote memorization, while mixed training improves performance but its underlying mechanism remains unclear. Method: We propose cross-task gradient tracing to systematically analyze parameter dynamics of Llama-3.2B and Pythia-2.8B on a synthetic factual dataset. Contribution/Results: We find that mixed training fosters larger, more concentrated shared parameter clusters—indicating that knowledge generalization arises from coordinated parameter optimization rather than localized memory strengthening. Empirically, this mechanism enhances robustness and cross-task transferability of factual recall. To our knowledge, this is the first study to elucidate how mixed training improves factual retrieval capability from an emergent parameter perspective, offering a novel pathway toward interpretable and generalizable knowledge modeling.

Technology Category

Application Category

📝 Abstract
Fact recall, the ability of language models (LMs) to retrieve specific factual knowledge, remains a challenging task despite their impressive general capabilities. Common training strategies often struggle to promote robust recall behavior with two-stage training, which first trains a model with fact-storing examples (e.g., factual statements) and then with fact-recalling examples (question-answer pairs), tending to encourage rote memorization rather than generalizable fact retrieval. In contrast, mixed training, which jointly uses both types of examples, has been empirically shown to improve the ability to recall facts, but the underlying mechanisms are still poorly understood. In this work, we investigate how these training strategies affect how model parameters are shaped during training and how these differences relate to their ability to recall facts. We introduce cross-task gradient trace to identify shared parameters, those strongly influenced by both fact-storing and fact-recalling examples. Our analysis on synthetic fact recall datasets with the Llama-3.2B and Pythia-2.8B models reveals that mixed training encouraging a larger and more centralized set of shared parameters. These findings suggest that the emergence of parameters may play a key role in enabling LMs to generalize factual knowledge across task formulations.
Problem

Research questions and friction points this paper is trying to address.

Investigates how training strategies affect fact recall in language models
Compares two-stage vs mixed training for memorization vs knowledge generalization
Analyzes shared parameters' role in improving factual knowledge retrieval
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage training promotes rote memorization
Mixed training improves fact recall ability
Cross-task gradient trace identifies shared parameters
🔎 Similar Papers
No similar papers found.
Y
Ying Zhang
RIKEN Center for Advanced Intelligence Project
Benjamin Heinzerling
Benjamin Heinzerling
RIKEN AIP / Tohoku University
Natural Language ProcessingComputational Linguistics
D
Dongyuan Li
The University of Tokyo
R
Ryoma Ishigaki
Tokyo Denki University, Alt Inc
Y
Yuta Hitomi
Alt Inc
Kentaro Inui
Kentaro Inui
MBZUAI, Tohoku University, RIKEN
natural language processingcomputational linguisticsLLM/LMM interpretability