Learning to Forget Attention: Memory Consolidation for Adaptive Compute Reduction

πŸ“… 2026-02-12
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the computational redundancy of existing attention mechanisms when processing repetitive patterns and their inability to dynamically reduce attention usage during training. Inspired by human memory consolidation, we propose CRAMβ€”a dynamic routing mechanism that progressively distills episodic memory into parameterized semantic memory. CRAM achieves, for the first time, adaptive reduction of attention consumption during training, thereby surpassing the theoretical limits of static sparse attention. Integrated within a hybrid architecture combining state space models and attention, our method attains 100% retrieval accuracy on the newly introduced SRCD benchmark using only 1.6% of the original attention computation. It reduces attention usage by 48–52% on unseen tasks and achieves a 37.8Γ— computational reduction within 3K training steps, with its attention decay curve closely mirroring human memory transformation dynamics.

Technology Category

Application Category

πŸ“ Abstract
Hybrid architectures combining state-space models with attention have achieved strong efficiency-quality tradeoffs, yet existing approaches either apply attention uniformly or learn static sparse patterns. This misses a key opportunity: \emph{attention demand should decrease over time as recurring patterns become familiar}. We present a surprising finding from analyzing GPT-2 models: \textbf{88\%} of attention operations retrieve information already predictable from the model's hidden state, and this redundancy does \emph{not} decrease during training. Motivated by this observation, we introduce \textbf{\ours{}} (\textbf{C}onsolidation-based \textbf{R}outing for \textbf{A}daptive \textbf{M}emory), a biologically inspired memory consolidation mechanism that gradually distills episodic retrievals into parametric semantic memory. Unlike prior sparse attention methods, \ours{} exhibits \emph{decreasing attention utilization} over training, achieving a \textbf{37.8$\times$} reduction through a sharp phase transition at approximately 3K steps. We prove that this capability is \emph{impossible} without consolidation: any static routing scheme requires $\Omega(f \cdot n)$ attention for tasks with recurring patterns of frequency $f$. On our proposed SRCD benchmark, \ours{} achieves \textbf{100\% retrieval accuracy} at 1.6\% attention compute (vs.\ 68\% for baselines), and consolidated patterns transfer to unseen tasks with \textbf{48--52\%} attention reduction without retraining. Remarkably, the learned consolidation dynamics quantitatively match human episodic-to-semantic memory transition curves from cognitive psychology ($\gamma = 0.43$ vs.\ $\gamma_{\text{human}} \approx 0.4$--$0.5$). Code and benchmarks are available at [anonymized].
Problem

Research questions and friction points this paper is trying to address.

attention redundancy
memory consolidation
adaptive compute
sparse attention
recurring patterns
Innovation

Methods, ideas, or system contributions that make the work stand out.

memory consolidation
adaptive compute reduction
sparse attention
state-space models
episodic-to-semantic memory
πŸ”Ž Similar Papers
No similar papers found.