🤖 AI Summary
This work addresses key challenges in long-context language modeling—attention dilution, loss of critical information, and poor generalization to novel test-time distributions—by formalizing test-time adaptation as a memory integration problem under constrained computational budgets. The authors propose an information-theoretic utility metric for context segments, coupled with a differentiable working memory module and a gated write controller, to dynamically select and integrate high-value contextual information. This approach ensures global coverage while substantially reducing gradient variance and computational overhead. Empirical results demonstrate that the method matches or exceeds state-of-the-art baselines on ZeroSCROLLS and LongBench v2 using only one-quarter of the gradient update steps, establishing a new Pareto frontier in the trade-off between efficiency and performance.
📝 Abstract
Long contexts challenge transformers: attention scores dilute across thousands of tokens, critical information is often lost in the middle, and models struggle to adapt to novel patterns at inference time. Recent work on test-time adaptation addresses this by maintaining a form of working memory -- transient parameters updated on the current context -- but existing approaches rely on uniform write policies that waste computation on low-utility regions and suffer from high gradient variance across semantically heterogeneous contexts. In this work, we reframe test-time adaptation as a budget-constrained memory consolidation problem, focusing on which parts of the context should be consolidated into working memory under limited computation. We propose Gdwm (Gated Differentiable Working Memory), a framework that introduces a write controller to gate the consolidation process. The controller estimates Contextual Utility, an information-theoretic measure of long-range contextual dependence, and allocates gradient steps accordingly while maintaining global coverage. Experiments on ZeroSCROLLS and LongBench v2 demonstrate that Gdwm achieves comparable or superior performance with 4$\times$ fewer gradient steps than uniform baselines, establishing a new efficiency-performance Pareto frontier for test-time adaptation.