🤖 AI Summary
Large language models (LLMs) rely on parameter inflation to store comprehensive knowledge, impeding edge deployment and yielding inefficient knowledge utilization. To address this, we propose a hierarchical parametric memory-augmented architecture: long-tail knowledge is externalized into a retrievable memory module, while a compact model focuses on learning general-purpose representations and reasoning. We design a hierarchical feedforward memory structure with dynamic retrieval, seamlessly integrated into Transformer architectures and enabling hardware-friendly scalability. Through end-to-end large-scale pretraining, a 160M-parameter model augmented with an 18M-parameter memory block—derived from a 4.6B-memory-entry repository—matches the performance of a 350M-parameter baseline across diverse downstream tasks. The approach generalizes robustly across multiple Transformer variants. Our core contribution lies in decoupling universal capabilities from long-tail knowledge via distinct yet synergistic modeling, enabling efficient joint optimization and deployment.
📝 Abstract
The impressive performance gains of modern language models currently rely on scaling parameters: larger models store more world knowledge and reason better. Yet compressing all world knowledge into parameters is unnecessary, as only a fraction is used per prompt, and impractical for edge devices with limited inference-time memory and compute. We address this shortcoming by a memory-augmented architecture and a pretraining strategy aligned with existing hardware paradigms. We introduce small language models that access large hierarchical parametric memory banks encoding world knowledge. During pretraining and inference, we fetch a small, context-dependent memory block and add it to the model. Our pretraining learns to store long-tail world knowledge in the memory parameters, while the small language model acts as an anchor capturing common knowledge and general reasoning abilities. Through trillion-token-scale experiments, we show significant gains: a 160M-parameters model augmented with an 18M-parameters memory fetched from a 4.6B memory bank obtains comparable performance to a regular model with more than 2x the parameters. Through extensive experiments, we study the optimal type and size of parametric memories in transformers, scaling them to over 21B parameters. We find that our proposed hierarchical feed-forward memories work robustly across transformer architectures, whether added during pretraining or post-hoc.