Knowledge Offloading: Decomposing LLMs into Sparse Backbones and Memory Modules

📅 2026-05-27

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

This work addresses the challenge that large language models tightly couple general-purpose capabilities with domain-specific knowledge within a single set of parameters, limiting flexible reuse and customization. To overcome this, the authors propose the Knowledge Offloading (KOFF) framework, which decouples a pretrained model into a sparse shared backbone and plug-and-play external domain memory modules through joint learning of structured pruning masks and lightweight recovery modules. This approach is the first to effectively migrate domain knowledge into dedicated memory while preserving general capabilities in the backbone. With only 12% global sparsity, the method achieves performance close to the original dense model. Ablation studies further reveal that LoRA adapters and key-value memory are complementary, with the backbone retaining general linguistic neurons and the offloaded modules specializing in domain-specific knowledge, thereby enabling modular and specialized model design.

📝 Abstract

LLMs encode both general capabilities and domain-specific knowledge in a single set of parameters. We ask whether this capacity can be reorganized: keeping broadly useful computation in a shared backbone, while moving specialized knowledge into external memory modules. We propose \emph{knowledge offloading} (KOFF), a framework for decomposing a pretrained LLM into a sparse shared backbone and domain-specific memories. Starting from a frozen base model, we jointly learn a structured pruning mask and lightweight recovery modules, implemented as LoRA adapters and learned key-value caches. Across Llama and Qwen models from 3B to 8B, we find that non-trivial capacity can be moved out of the shared backbone without a large loss in model ability. At around 12\% global sparsity, KOFF preserves much of the unpruned model's performance, while pruning the same frozen model without memories degrades sharply. Ablations show that LoRA and learned KV memories are complementary, and specialization analyses suggest that the learned decomposition is meaningful: language-specific neurons are preferentially removed while language-general neurons largely remain in the backbone. These results suggest that knowledge can be reallocated between a shared core and swappable external memories.

Problem

Research questions and friction points this paper is trying to address.

knowledge offloading

large language models

model decomposition

sparse backbone

external memory

Innovation

Methods, ideas, or system contributions that make the work stand out.

knowledge offloading

sparse backbone

memory modules