🤖 AI Summary
Language models’ factual recall is constrained by training paradigms: two-stage training often induces rote memorization, while mixed training improves performance but its underlying mechanism remains unclear.
Method: We propose cross-task gradient tracing to systematically analyze parameter dynamics of Llama-3.2B and Pythia-2.8B on a synthetic factual dataset.
Contribution/Results: We find that mixed training fosters larger, more concentrated shared parameter clusters—indicating that knowledge generalization arises from coordinated parameter optimization rather than localized memory strengthening. Empirically, this mechanism enhances robustness and cross-task transferability of factual recall. To our knowledge, this is the first study to elucidate how mixed training improves factual retrieval capability from an emergent parameter perspective, offering a novel pathway toward interpretable and generalizable knowledge modeling.
📝 Abstract
Fact recall, the ability of language models (LMs) to retrieve specific factual knowledge, remains a challenging task despite their impressive general capabilities. Common training strategies often struggle to promote robust recall behavior with two-stage training, which first trains a model with fact-storing examples (e.g., factual statements) and then with fact-recalling examples (question-answer pairs), tending to encourage rote memorization rather than generalizable fact retrieval. In contrast, mixed training, which jointly uses both types of examples, has been empirically shown to improve the ability to recall facts, but the underlying mechanisms are still poorly understood. In this work, we investigate how these training strategies affect how model parameters are shaped during training and how these differences relate to their ability to recall facts. We introduce cross-task gradient trace to identify shared parameters, those strongly influenced by both fact-storing and fact-recalling examples. Our analysis on synthetic fact recall datasets with the Llama-3.2B and Pythia-2.8B models reveals that mixed training encouraging a larger and more centralized set of shared parameters. These findings suggest that the emergence of parameters may play a key role in enabling LMs to generalize factual knowledge across task formulations.