Dynamic Long Context Reasoning over Compressed Memory via End-to-End Reinforcement Learning

📅 2026-02-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges large language models face in processing extremely long contexts—namely high computational costs, information forgetting, and fragmentation—by introducing a cognitively inspired compressed memory architecture. The proposed framework integrates chunk-wise compression, gated memory selection, and an evolving working memory mechanism, jointly optimized via end-to-end reinforcement learning to dynamically retrieve relevant memories and support extrapolation over ultra-long contexts. Moving beyond conventional retrieval-augmented generation (RAG) limitations, the method achieves state-of-the-art accuracy on multi-hop reasoning benchmarks such as RULER-HQA, extends the effective context length from 7K to 1.75M tokens, reduces peak GPU memory usage by half, and accelerates inference by sixfold.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) face significant challenges in long-context processing, including quadratic computational costs, information forgetting, and the context fragmentation inherent in retrieval-augmented generation (RAG). We propose a cognitively inspired framework for efficient long-context inference based on chunk-wise compression and selective memory recall, rather than processing all raw tokens. The framework segments long inputs into chunks and encodes each chunk into compressed memory representations using a learned compressor. A gating module dynamically selects relevant memory blocks, which are then iteratively processed by a reasoning module with an evolving working memory to solve downstream tasks. The compressor and reasoner are jointly optimized via end-to-end reinforcement learning, while the gating module is trained separately as a classifier. Experimental results show that the proposed method achieves competitive accuracy on multi-hop reasoning benchmarks such as RULER-HQA, extrapolates context length from 7K to 1.75M tokens, and offers a favorable accuracy-efficiency trade-off compared to strong long-context baselines. In particular, it achieves up to a 2 times reduction in peak GPU memory usage and a 6 times inference speedup over MemAgent.
Problem

Research questions and friction points this paper is trying to address.

long-context reasoning
information forgetting
context fragmentation
computational cost
retrieval-augmented generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

compressed memory
dynamic gating
end-to-end reinforcement learning
long-context reasoning
chunk-wise compression
🔎 Similar Papers
No similar papers found.