π€ AI Summary
This work addresses the high memory overhead incurred by the self-attention mechanism of large language models during long-sequence inference. To mitigate this issue, the authors propose a causal synchronization mechanism that compresses historical context into model parameters while preserving the pretrained modelβs causal structure. By leveraging a self-supervised objective that aligns the future generation behavior of the original and updated models, the method enables efficient learning of parameterized memory. This approach avoids the pitfalls of existing constant-memory techniques, which either discard long-range dependencies or compromise causality through test-time training. Experimental results demonstrate that the proposed method significantly reduces memory consumption in both long-context and streaming inference scenarios, while consistently outperforming current parameter-based memory approaches in terms of accuracy and efficiency.
π Abstract
Transformers suffer from a high computational cost that grows with sequence length for self-attention, making inference in long streams prohibited by memory consumption. Constant-memory alternatives such as RNNs and SSMs compress history into states with fixed size and thus lose long-tail dependencies, while methods that memorize contexts into parameters, such as Test-Time Training (TTT), are prone to overfitting token-level projection and fail to preserve the causal effect of context in pretrained LLMs. We propose Absorber LLM, which formulates long-context retention as a self-supervised causal synchronization: after absorbing historical contexts into parameters, a contextless model should match the original model with full context on future generations. We optimize this objective by synchronizing internal behaviors of the updated model with the original one, ensuring context absorption and generalization. Experiments on long-context and streaming benchmarks show that Absorber LLM reduces inference memory and improves accuracy over prior parameter-as-memory baselines.