Locas: Your Models are Principled Initializers of Locally-Supported Parametric Memories

📅 2026-02-04

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

This work addresses the challenge of achieving efficient test-time continual learning while mitigating catastrophic forgetting and minimizing parameter overhead. The authors propose Locas, a locally supported memory module aligned with the Transformer’s feed-forward network (FFN) architecture, which can be flexibly plugged in or removed. Its key innovation lies in principled initialization of low-rank bypass FFNs using model parameters, activations, or gradients, enabling rapid convergence, strong generalization, and effective forgetting suppression, while seamlessly integrating into existing large models. Two variants are introduced: a theoretically grounded standard two-layer MLP and a GLU-FFN variant compatible with state-of-the-art large language models. Experiments demonstrate that Locas achieves effective long-context memorization on PG-19 and LoCoMo tasks with only a 0.02% parameter increase, significantly reducing the required context window while preserving original knowledge as measured by MMLU performance.

Technology Category

Application Category

📝 Abstract

In this paper, we aim to bridge test-time-training with a new type of parametric memory that can be flexibly offloaded from or merged into model parameters. We present Locas, a Locally-Supported parametric memory that shares the design of FFN blocks in modern transformers, allowing it to be flexibly permanentized into the model parameters while supporting efficient continual learning. We discuss two major variants of Locas: one with a conventional two-layer MLP design that has a clearer theoretical guarantee; the other one shares the same GLU-FFN structure with SOTA LLMs, and can be easily attached to existing models for both parameter-efficient and computation-efficient continual learning. Crucially, we show that proper initialization of such low-rank sideway-FFN-style memories -- performed in a principled way by reusing model parameters, activations and/or gradients -- is essential for fast convergence, improved generalization, and catastrophic forgetting prevention. We validate the proposed memory mechanism on the PG-19 whole-book language modeling and LoCoMo long-context dialogue question answering tasks. With only 0.02\% additional parameters in the lowest case, Locas-GLU is capable of storing the information from past context while maintaining a much smaller context window. In addition, we also test the model's general capability loss after memorizing the whole book with Locas, through comparative MMLU evaluation. Results show the promising ability of Locas to permanentize past context into parametric knowledge with minimized catastrophic forgetting of the model's existing internal knowledge.

Problem

Research questions and friction points this paper is trying to address.

continual learning

catastrophic forgetting

parametric memory

context compression

test-time training

Innovation

Methods, ideas, or system contributions that make the work stand out.

parametric memory

test-time training

catastrophic forgetting