It's All Connected: A Journey Through Test-Time Memorization, Attentional Bias, Retention, and Online Optimization

📅 2025-04-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing foundation model architectures face bottlenecks in computational efficiency and generalization capability, particularly in long-sequence modeling. Method: This work unifies sequence models—including Transformer, Titan, and linear RNNs—into a common associative memory module. It introduces three key innovations: (1) modeling human cognitive “attention bias” as an optimizable internal objective, replacing standard dot-product or L2-distance similarity; (2) a Retention Gate mechanism enabling regularized forgetting and selective retention of salient memories; and (3) Miras, a decoupled, general-purpose framework integrating memory architecture, bias objective, gating, and learning algorithms. Contribution/Results: Based on Miras, we propose Moneta, Yaad, and Memora—models that outperform both Transformers and state-of-the-art linear RNNs across language modeling, commonsense reasoning, and recall-intensive tasks, while achieving superior training efficiency and strong zero-shot and out-of-distribution generalization.

Technology Category

Application Category

📝 Abstract
Designing efficient and effective architectural backbones has been in the core of research efforts to enhance the capability of foundation models. Inspired by the human cognitive phenomenon of attentional bias-the natural tendency to prioritize certain events or stimuli-we reconceptualize neural architectures, including Transformers, Titans, and modern linear recurrent neural networks as associative memory modules that learn a mapping of keys and values using an internal objective, referred to as attentional bias. Surprisingly, we observed that most existing sequence models leverage either (1) dot-product similarity, or (2) L2 regression objectives as their attentional bias. Going beyond these objectives, we present a set of alternative attentional bias configurations along with their effective approximations to stabilize their training procedure. We then reinterpret forgetting mechanisms in modern deep learning architectures as a form of retention regularization, providing a novel set of forget gates for sequence models. Building upon these insights, we present Miras, a general framework to design deep learning architectures based on four choices of: (i) associative memory architecture, (ii) attentional bias objective, (iii) retention gate, and (iv) memory learning algorithm. We present three novel sequence models-Moneta, Yaad, and Memora-that go beyond the power of existing linear RNNs while maintaining a fast parallelizable training process. Our experiments show different design choices in Miras yield models with varying strengths. For example, certain instances of Miras achieve exceptional performance in special tasks such as language modeling, commonsense reasoning, and recall intensive tasks, even outperforming Transformers and other modern linear recurrent models.
Problem

Research questions and friction points this paper is trying to address.

Reconceptualize neural architectures as associative memory modules
Explore alternative attentional bias configurations for sequence models
Introduce Miras framework for designing efficient deep learning architectures
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reconceptualize neural architectures as associative memory modules
Introduce alternative attentional bias configurations and approximations
Propose Miras framework with novel sequence models
🔎 Similar Papers
No similar papers found.