🤖 AI Summary
Long-sequence modeling faces a fundamental trade-off between computational efficiency—achieved by recurrent architectures with fixed memory—and representational fidelity—ensured by Transformer-based models with lossless key-value (KV) caching. This work introduces a cognitive science–inspired multi-store memory framework: a sliding-window KV cache serves as lossless short-term memory, while a learnable Artificial Hippocampal Network (AHN) enables selective compression and long-term retention of historical information. To our knowledge, this is the first incorporation of a multi-store memory mechanism into deep learning. The AHN is built upon modern RNN architectures—including Mamba2 and DeltaNet—and synergistically integrates Transformer-style caching with recurrent compression. Evaluated on LV-Eval and InfiniteBench, our approach significantly outperforms sliding-window baselines: for Qwen2.5-3B-Instruct, inference FLOPs decrease by 40.5%, KV cache size reduces by 74.0%, and long-sequence performance improves from 4.41 to 5.88.
📝 Abstract
Long-sequence modeling faces a fundamental trade-off between the efficiency of compressive fixed-size memory in RNN-like models and the fidelity of lossless growing memory in attention-based Transformers. Inspired by the Multi-Store Model in cognitive science, we introduce a memory framework of artificial neural networks. Our method maintains a sliding window of the Transformer's KV cache as lossless short-term memory, while a learnable module termed Artificial Hippocampus Network (AHN) recurrently compresses out-of-window information into a fixed-size compact long-term memory. To validate this framework, we instantiate AHNs using modern RNN-like architectures, including Mamba2, DeltaNet, and Gated DeltaNet. Extensive experiments on long-context benchmarks LV-Eval and InfiniteBench demonstrate that AHN-augmented models consistently outperform sliding window baselines and achieve performance comparable or even superior to full-attention models, while substantially reducing computational and memory requirements. For instance, augmenting the Qwen2.5-3B-Instruct with AHNs reduces inference FLOPs by 40.5% and memory cache by 74.0%, while improving its average score on LV-Eval (128k sequence length) from 4.41 to 5.88. Code is available at: https://github.com/ByteDance-Seed/AHN.