π€ AI Summary
To address catastrophic forgetting in parameter-efficient fine-tuning, high inference latency in retrieval-augmented generation (RAG), and prohibitive costs of domain-specific pretraining for large language models (LLMs), this paper proposes Memory Decoderβa plug-and-play, parameter-free pretraining memory module that requires no modification to the base model. Its core innovation is the first design of a portable, lightweight Transformer decoder that emulates non-parametric retrieval behavior, coupled with a domain-adaptive training strategy to enable low-overhead, low-latency memory injection. The module is universally deployable across diverse LLM architectures (e.g., Qwen, Llama) and domains (e.g., biomedicine, finance, law). Extensive experiments demonstrate an average perplexity reduction of 6.17 points across multiple models and tasks, validating its efficiency, stability, and strong generalization capability.
π Abstract
Large Language Models (LLMs) have shown strong abilities in general language tasks, yet adapting them to specific domains remains a challenge. Current method like Domain Adaptive Pretraining (DAPT) requires costly full-parameter training and suffers from catastrophic forgetting. Meanwhile, Retrieval-Augmented Generation (RAG) introduces substantial inference latency due to expensive nearest-neighbor searches and longer context. This paper introduces Memory Decoder, a plug-and-play pretrained memory that enables efficient domain adaptation without changing the original model's parameters. Memory Decoder employs a small transformer decoder that learns to imitate the behavior of an external non-parametric retriever. Once trained, Memory Decoder can be seamlessly integrated with any pretrained language model that shares the same tokenizer, requiring no model-specific modifications. Experimental results demonstrate that Memory Decoder enables effective adaptation of various Qwen and Llama models to three distinct specialized domains: biomedicine, finance, and law, reducing perplexity by an average of 6.17 points. Overall, Memory Decoder introduces a novel paradigm centered on a specially pretrained memory component designed for domain-specific adaptation. This memory architecture can be integrated in a plug-and-play manner, consistently enhancing performance across multiple models within the target domain.