Learning to Remember, Learn, and Forget in Attention-Based Models

📅 2026-02-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the stability-plasticity imbalance in gated linear attention models during in-context learning, which arises from fixed memory capacity and interference. Framing the problem as a continual learning task, the authors propose the Palimpsa framework, which incorporates a Bayesian meta-plasticity mechanism to dynamically modulate the plasticity of attention states based on their importance, thereby enabling intelligent retention and forgetting of memories. Palimpsa unifies various gated attention architectures, revealing Mamba2 as a special case dominated by forgetting, and provides a general method to convert any non-meta-plastic model into a meta-plastic one. Experiments demonstrate that the approach significantly outperforms baselines on the MQAR benchmark and commonsense reasoning tasks, effectively expanding effective memory capacity and enhancing long-sequence processing performance.

Technology Category

Application Category

📝 Abstract
In-Context Learning (ICL) in transformers acts as an online associative memory and is believed to underpin their high performance on complex sequence processing tasks. However, in gated linear attention models, this memory has a fixed capacity and is prone to interference, especially for long sequences. We propose Palimpsa, a self-attention model that views ICL as a continual learning problem that must address a stability-plasticity dilemma. Palimpsa uses Bayesian metaplasticity, where the plasticity of each attention state is tied to an importance state grounded by a prior distribution that captures accumulated knowledge. We demonstrate that various gated linear attention models emerge as specific architecture choices and posterior approximations, and that Mamba2 is a special case of Palimpsa where forgetting dominates. This theoretical link enables the transformation of any non-metaplastic model into a metaplastic one, significantly expanding its memory capacity. Our experiments show that Palimpsa consistently outperforms baselines on the Multi-Query Associative Recall (MQAR) benchmark and on Commonsense Reasoning tasks.
Problem

Research questions and friction points this paper is trying to address.

In-Context Learning
attention-based models
memory capacity
stability-plasticity dilemma
interference
Innovation

Methods, ideas, or system contributions that make the work stand out.

Bayesian metaplasticity
In-Context Learning
stability-plasticity dilemma
gated linear attention
memory capacity
🔎 Similar Papers
No similar papers found.
D
Djohan Bonnet
Forschungszentrum Jülich, Germany
J
Jamie Lohoff
Forschungszentrum Jülich, Germany; RWTH Aachen, Germany
J
Jan Finkbeiner
Forschungszentrum Jülich, Germany; RWTH Aachen, Germany
E
Elidona Skhikerujah
RWTH Aachen, Germany
Emre Neftci
Emre Neftci
Institute Director, Forschungszentrum Jülich; Professor, RWTH Aachen
Neuromorphic EngineeringComputational NeuroscienceCognitive Systems and BehaviorMachine Learning