π€ AI Summary
This work proposes Elastic Memory, a novel architecture addressing the quadratic complexity bottleneck of Transformers in long-context processing and the trade-off between theoretical rigor and scalability in existing recurrent memory methods. By modeling historical sequences as continuous signals and leveraging the HiPPO framework, the approach enables online optimal compression into a fixed-size memory state. A reconstructible polynomial sampling mechanism is introduced to flexibly recover historical summaries at test time. The key innovation lies in decoupling theoretically optimal compression from inductive biases during inference, thereby achieving both efficiency and adaptability. Experiments demonstrate that on tasks with 32k+ context lengths, Elastic Memory reduces memory usage by 16Γ compared to Memorizing Transformer at equal parameter counts and outperforms Melodiβa model with 30% more parameters. Moreover, even when scaled up by 4Γ, it maintains superior performance with faster inference.
π Abstract
Transformers face a quadratic bottleneck in attention when scaling to long contexts. Recent approaches introduce recurrent memory to extend context beyond the current window, yet these often face a fundamental trade-off between theoretical principles and practical scalability. To address this, we introduce Elastic Memory, a novel memory architecture grounded in the HiPPO framework for online function approximation. Elastic Memory treats historical sequence as samples from continuous signals, applying optimal online compression to encode them into a fixed-size memory state. For retrieval, we propose a flexible \textit{polynomial sampling} mechanism that reconstructs a history summary from this compressed state. Elastic Memory consistently outperformed baselines on long-context (32k+) datasets across three domains. With equal parameters, it beat Memorizing Transformer by 16x memory and outperformed Melodi at all memory sizes, even when Melodi had 30% more parameters. When scaling model size, Elastic Memory stayed ahead of all baselines and was significantly faster than Melodi at 4x size. Furthermore, its decoupled design allows for injecting inductive biases at test-time to boost performance.