Tensor Cache: Eviction-conditioned Associative Memory for Transformers

📅 2026-05-20

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

This work addresses the linear growth of Transformer KV cache with context length and the loss of critical out-of-window information in sliding window approaches. We propose Tensor Cache, which introduces outer-product-based fast weight memory as an L2 cache populated exclusively by evicted entries from an L1 sliding window attention mechanism, enabling efficient retrieval of long-range information through a single matrix multiplication. To fuse outputs from both cache levels, we design a learnable gating mechanism and introduce a parallel weighted scan algorithm equivalent to per-token writing, resolving spurious cross-term artifacts in chunked mean training. Experiments demonstrate that our method significantly outperforms existing bounded-state baselines in system scalability, associative recall, long-context language modeling, and memory capacity, advancing the frontier of the memory–quality trade-off.

📝 Abstract

Autoregressive Transformer KV caches grow linearly with context length; sliding-window caching bounds memory but discards evicted tokens entirely, so relevant evidence outside the window becomes inaccessible. We introduce \emph{Tensor Cache}, a two-level cache that pairs sliding-window softmax attention as a first-level cache (L1) with a fixed-size outer-product fast-weight memory as a second-level cache (L2) fed by KV pairs evicted from the window. Recent tokens remain in exact local attention; evicted pairs are compressed into a per-layer matrix $A$ and read by future queries through a single matrix multiplication, exploiting the linear-attention identity $q_t(k_i \otimes v_i)=\langle q_t,k_i\rangle v_i$. A learned scalar gate fuses the L1 and L2 outputs, and per-head decay and write-rate parameters are trained end-to-end. The outer-product memory and the read identity are well-known; our contribution is their use as an L2 cache fed exclusively by sliding-window evictions, plus identifying that the common chunked-mean training shortcut $A\!\leftarrow\!λA\!+\!η(\bar k\!\otimes\!\bar v)$ silently introduces $C^2{-}C$ spurious cross-token outer products per chunk, and closing the gap with a parallel weighted-sum scan equivalent to per-token writes within float32 epsilon. Across systems scaling, controlled associative recall, long-context language modeling, and memory-capacity diagnostics, Tensor Cache improves the memory--quality frontier over bounded-state baselines.

Problem

Research questions and friction points this paper is trying to address.

KV cache

sliding-window caching

memory eviction

long-context modeling

associative memory

Innovation

Methods, ideas, or system contributions that make the work stand out.

Tensor Cache

sliding-window caching

fast-weight memory