🤖 AI Summary
This work identifies a critical issue in language model pretraining: persistent decay of “knowledge entropy” impedes acquisition of new knowledge, as model parameters increasingly concentrate knowledge into a few memorized sources, degrading knowledge assimilation capacity and exacerbating forgetting. The authors introduce, for the first time, a quantifiable notion of knowledge entropy—formalized via gradient attribution and memory-source decomposition—and validate its strong negative correlation with knowledge acquisition performance through cross-stage parameter sensitivity analysis and controlled intervention experiments. Crucially, they propose a novel mechanism to reactivate dormant memory sources, thereby significantly improving accuracy on downstream knowledge-intensive tasks (+4.2%) and enhancing long-term knowledge retention. This approach provides both a theoretically grounded framework and a practical, actionable pathway for reversing learning degradation in large language models.
📝 Abstract
In this work, we investigate how a model's tendency to broadly integrate its parametric knowledge evolves throughout pretraining, and how this behavior affects overall performance, particularly in terms of knowledge acquisition and forgetting. We introduce the concept of knowledge entropy, which quantifies the range of memory sources the model engages with; high knowledge entropy indicates that the model utilizes a wide range of memory sources, while low knowledge entropy suggests reliance on specific sources with greater certainty. Our analysis reveals a consistent decline in knowledge entropy as pretraining advances. We also find that the decline is closely associated with a reduction in the model's ability to acquire and retain knowledge, leading us to conclude that diminishing knowledge entropy (smaller number of active memory sources) impairs the model's knowledge acquisition and retention capabilities. We find further support for this by demonstrating that increasing the activity of inactive memory sources enhances the model's capacity for knowledge acquisition and retention.