Blending Complementary Memory Systems in Hybrid Quadratic-Linear Transformers

📅 2025-05-31

📈 Citations: 0

✨ Influential: 0

career value

242K/year

🤖 AI Summary

Existing KV-memory suffers from quadratic computational complexity, while FW-memory exhibits low recall accuracy—fundamental limitations hindering efficient long-range modeling. To address this, we propose the first plug-and-play hybrid memory architecture integrating key-value memory with dynamic synaptic memory. Our approach introduces three novel synergistic mechanisms: (1) soft-attention-guided initialization of FW weights, (2) dynamic weight updates driven by KV-retrieval results, and (3) a hybrid attention scheduling strategy. These jointly enable both precise retrieval and efficient long-range sequence modeling. Crucially, our work is the first to systematically elucidate the principles underlying complementary memory cooperation, thereby breaking the traditional accuracy–efficiency trade-off. Extensive experiments on 340M- and 1.3B-parameter language models demonstrate significant improvements in long-context language modeling and retrieval performance. Further validation on synthetic algorithmic tasks and partially observable Markov decision processes (POMDPs) confirms the architecture’s generalizability and robustness.

Technology Category

Application Category

📝 Abstract

We develop hybrid memory architectures for general-purpose sequence processing neural networks, that combine key-value memory using softmax attention (KV-memory) with dynamic synaptic memory through fast-weight programming (FW-memory) -- the core principles of quadratic and linear transformers, respectively. These two memory systems have complementary but individually limited properties: KV-memory offers precise retrieval but is constrained by quadratic complexity in sequence length, while FW-memory supports arbitrarily long sequences and enables more expressive computation but sacrifices precise recall. We propose and compare three methods to blend these two systems into a single memory system to leverage the strengths of both. We conduct experiments on general language modeling and retrieval tasks by training 340M- and 1.3B-parameter models from scratch, as well as on synthetic algorithmic tasks designed to precisely illustrate the benefits of certain hybrid methods over others. We also evaluate our hybrid memory systems on reinforcement learning in partially observable environments. Overall, we demonstrate how a well-designed hybrid can overcome the limitations of its individual components, offering new insights into the design principle of neural memory systems.

Problem

Research questions and friction points this paper is trying to address.

Combine KV-memory and FW-memory for better sequence processing

Address limitations of quadratic complexity and imprecise recall

Evaluate hybrid memory in language and reinforcement learning tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid quadratic-linear transformer memory systems

Combine KV-memory with FW-memory

Overcome individual memory limitations

🔎 Similar Papers

Memory Mosaics