Learning to Forget Attention: Memory Consolidation for Adaptive Compute Reduction

📅 2026-02-12

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the computational redundancy of existing attention mechanisms when processing repetitive patterns and their inability to dynamically reduce attention usage during training. Inspired by human memory consolidation, we propose CRAM—a dynamic routing mechanism that progressively distills episodic memory into parameterized semantic memory. CRAM achieves, for the first time, adaptive reduction of attention consumption during training, thereby surpassing the theoretical limits of static sparse attention. Integrated within a hybrid architecture combining state space models and attention, our method attains 100% retrieval accuracy on the newly introduced SRCD benchmark using only 1.6% of the original attention computation. It reduces attention usage by 48–52% on unseen tasks and achieves a 37.8× computational reduction within 3K training steps, with its attention decay curve closely mirroring human memory transformation dynamics.

Technology Category

Application Category

📝 Abstract

Hybrid architectures combining state-space models with attention have achieved strong efficiency-quality tradeoffs, yet existing approaches either apply attention uniformly or learn static sparse patterns. This misses a key opportunity: \emph{attention demand should decrease over time as recurring patterns become familiar}. We present a surprising finding from analyzing GPT-2 models: \textbf{88\%} of attention operations retrieve information already predictable from the model's hidden state, and this redundancy does \emph{not} decrease during training. Motivated by this observation, we introduce \textbf{\ours{}} (\textbf{C}onsolidation-based \textbf{R}outing for \textbf{A}daptive \textbf{M}emory), a biologically inspired memory consolidation mechanism that gradually distills episodic retrievals into parametric semantic memory. Unlike prior sparse attention methods, \ours{} exhibits \emph{decreasing attention utilization} over training, achieving a \textbf{37.8$\times$} reduction through a sharp phase transition at approximately 3K steps. We prove that this capability is \emph{impossible} without consolidation: any static routing scheme requires $\Omega(f \cdot n)$ attention for tasks with recurring patterns of frequency $f$. On our proposed SRCD benchmark, \ours{} achieves \textbf{100\% retrieval accuracy} at 1.6\% attention compute (vs.\ 68\% for baselines), and consolidated patterns transfer to unseen tasks with \textbf{48--52\%} attention reduction without retraining. Remarkably, the learned consolidation dynamics quantitatively match human episodic-to-semantic memory transition curves from cognitive psychology ($\gamma = 0.43$ vs.\ $\gamma_{\text{human}} \approx 0.4$--$0.5$). Code and benchmarks are available at [anonymized].

Problem

Research questions and friction points this paper is trying to address.

attention redundancy

memory consolidation

adaptive compute

sparse attention

recurring patterns

Innovation

Methods, ideas, or system contributions that make the work stand out.

memory consolidation

adaptive compute reduction

sparse attention

state-space models

episodic-to-semantic memory

🔎 Similar Papers

No similar papers found.

Authors to Follow