Language Models Need Sleep

📅 2026-05-25

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

This work addresses the high computational cost of attention mechanisms that limits large language models in long-context tasks. The authors propose a sleep-like context consolidation mechanism that periodically compresses recent context into persistent fast weights and clears the KV cache. During offline “sleep” phases, these fast weights are updated via recurrent passes in a state space model (SSM), enabling low-latency inference upon “awakening.” This approach uniquely integrates sleep-inspired consolidation with SSMs, shifting computational load to non-real-time phases and substantially enhancing deep reasoning capabilities. Experiments demonstrate superior performance over standard Transformers and SSM–attention hybrids on cellular automata simulation, multi-hop graph retrieval, and mathematical reasoning tasks, with extended sleep intervals further improving results.

📝 Abstract

Transformer-based large language models are increasingly used for long-horizon tasks; however, their attention mechanism scales poorly with context length. To handle this, we study a sleep-like consolidation mechanism in which a model periodically converts recent context into persistent fast weights before clearing its key-value cache. During sleep, the model performs $N$ offline recurrent passes over the accumulated context and updates the fast weights in its state-space model (SSM) blocks through a learned local rule. During inference, this shifts extra computation to sleep while preserving the latency of wake-time prediction. We test our method on controlled synthetic tasks, including cellular automata and multi-hop graph retrieval, as well as a realistic math reasoning task, on which a regular transformer as well as SSM-attention hybrid models fail. We then show that increasing sleep duration $N$ for our models improves performance, with the largest gains on examples that require deeper reasoning.

Problem

Research questions and friction points this paper is trying to address.

long-horizon tasks

attention mechanism

context length

computational scalability

reasoning depth

Innovation

Methods, ideas, or system contributions that make the work stand out.

sleep-like consolidation

fast weights

state-space model