Continual Memorization of Factoids in Language Models

📅 2024-11-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Pretrained language models (LMs) suffer severe catastrophic forgetting when continually acquiring new knowledge—particularly during multi-stage fine-tuning, where factually grounded representations learned in earlier stages rapidly degrade, exacerbating hallucination risks. This work formally defines the “continual factual memory” problem and proposes REMIX: a simple yet effective method that interleaves each fine-tuning stage with random token sequences or generic pretraining data to mitigate parameter interference. Experiments demonstrate that REMIX induces factual knowledge to migrate toward shallower, more distributed neural representations, significantly improving recallability and controllability. On multiple benchmarks, REMIX outperforms mainstream continual learning approaches—including replay-based methods—restoring factual accuracy even under severe forgetting. Furthermore, it reveals that robust factual memory relies on hierarchical, distributed storage mechanisms across model layers.

Technology Category

Application Category

📝 Abstract
As new knowledge rapidly accumulates, language models (LMs) with pretrained knowledge quickly become obsolete. A common approach to updating LMs is fine-tuning them directly on new knowledge. However, recent studies have shown that fine-tuning for memorization may be ineffective in storing knowledge or may exacerbate hallucinations. In this work, we introduce a setting we call continual memorization, where a model must memorize and retain a set of factoids through multiple stages of fine-tuning on subsequent datasets. We characterized the forgetting patterns through extensive experiments and show that LMs widely suffer from forgetting, especially when needing to memorize factoids in the second stage. We posit that forgetting can be alleviated by modifying training dynamics: (1) protecting the memorization process when learning factoids or (2) reducing interference from subsequent training stages. Intriguingly, we find that mixing randomly generated word sequences or generic data sampled from pretraining corpora at different training stages effectively mitigates forgetting REMIX: Random and Generic Data Mixing). REMIX can recover performance from severe forgetting, outperforming replay methods and other continual learning baselines. We analyze how REMIX influences the learning process and find that robust memorization follows a distinct pattern: the model stores factoids in earlier layers than usual and diversifies the layers that retain them, which results in easier recall and manipulate of the learned factoids.
Problem

Research questions and friction points this paper is trying to address.

Preventing knowledge obsolescence in language models
Mitigating forgetting during continual memorization
Enhancing factoid retention through modified training dynamics
Innovation

Methods, ideas, or system contributions that make the work stand out.

Continual memorization technique
REMIX data mixing strategy
Layer diversification for retention
🔎 Similar Papers
No similar papers found.