The Pensieve Paradigm: Stateful Language Models Mastering Their Own Context

📅 2026-02-12

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

Traditional large language models are constrained by fixed context windows, limiting their effectiveness in tasks requiring long-document understanding, sustained dialogue, or deep research. This work proposes StateLM, a foundational model equipped with an internal reasoning loop that, for the first time, endows language models with the ability to autonomously manage their memory state. By integrating memory tools such as context pruning, document indexing, and note-taking, StateLM dynamically optimizes and restructures its context to enable stateful, controllable reasoning. Experimental results demonstrate that StateLM significantly outperforms standard large language models on long-document question answering, conversational memory retention, and the BrowseComp-Plus deep research benchmark, achieving accuracy improvements of up to 47 percentage points.

Technology Category

Application Category

📝 Abstract

In the world of Harry Potter, when Dumbledore's mind is overburdened, he extracts memories into a Pensieve to be revisited later. In the world of AI, while we possess the Pensieve-mature databases and retrieval systems, our models inexplicably lack the"wand"to operate it. They remain like a Dumbledore without agency, passively accepting a manually engineered context as their entire memory. This work finally places the wand in the model's hand. We introduce StateLM, a new class of foundation models endowed with an internal reasoning loop to manage their own state. We equip our model with a suite of memory tools, such as context pruning, document indexing, and note-taking, and train it to actively manage these tools. By learning to dynamically engineering its own context, our model breaks free from the architectural prison of a fixed window. Experiments across various model sizes demonstrate StateLM's effectiveness across diverse scenarios. On long-document QA tasks, StateLMs consistently outperform standard LLMs across all model scales; on the chat memory task, they achieve absolute accuracy improvements of 10% to 20% over standard LLMs. On the deep research task BrowseComp-Plus, the performance gap becomes even more pronounced: StateLM achieves up to 52% accuracy, whereas standard LLM counterparts struggle around 5%. Ultimately, our approach shifts LLMs from passive predictors to state-aware agents where reasoning becomes a stateful and manageable process.

Problem

Research questions and friction points this paper is trying to address.

stateful language models

context management

memory tools

long-context reasoning

agentive LLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Stateful Language Models

Context Management

Memory Tools