π€ AI Summary
This work addresses the computational redundancy of existing attention mechanisms when processing repetitive patterns and their inability to dynamically reduce attention usage during training. Inspired by human memory consolidation, we propose CRAMβa dynamic routing mechanism that progressively distills episodic memory into parameterized semantic memory. CRAM achieves, for the first time, adaptive reduction of attention consumption during training, thereby surpassing the theoretical limits of static sparse attention. Integrated within a hybrid architecture combining state space models and attention, our method attains 100% retrieval accuracy on the newly introduced SRCD benchmark using only 1.6% of the original attention computation. It reduces attention usage by 48β52% on unseen tasks and achieves a 37.8Γ computational reduction within 3K training steps, with its attention decay curve closely mirroring human memory transformation dynamics.
π Abstract
Hybrid architectures combining state-space models with attention have achieved strong efficiency-quality tradeoffs, yet existing approaches either apply attention uniformly or learn static sparse patterns. This misses a key opportunity: \emph{attention demand should decrease over time as recurring patterns become familiar}. We present a surprising finding from analyzing GPT-2 models: \textbf{88\%} of attention operations retrieve information already predictable from the model's hidden state, and this redundancy does \emph{not} decrease during training. Motivated by this observation, we introduce \textbf{\ours{}} (\textbf{C}onsolidation-based \textbf{R}outing for \textbf{A}daptive \textbf{M}emory), a biologically inspired memory consolidation mechanism that gradually distills episodic retrievals into parametric semantic memory. Unlike prior sparse attention methods, \ours{} exhibits \emph{decreasing attention utilization} over training, achieving a \textbf{37.8$\times$} reduction through a sharp phase transition at approximately 3K steps. We prove that this capability is \emph{impossible} without consolidation: any static routing scheme requires $\Omega(f \cdot n)$ attention for tasks with recurring patterns of frequency $f$. On our proposed SRCD benchmark, \ours{} achieves \textbf{100\% retrieval accuracy} at 1.6\% attention compute (vs.\ 68\% for baselines), and consolidated patterns transfer to unseen tasks with \textbf{48--52\%} attention reduction without retraining. Remarkably, the learned consolidation dynamics quantitatively match human episodic-to-semantic memory transition curves from cognitive psychology ($\gamma = 0.43$ vs.\ $\gamma_{\text{human}} \approx 0.4$--$0.5$). Code and benchmarks are available at [anonymized].