🤖 AI Summary
Large language models (LLMs) face significant challenges in long-context processing, including prohibitive computational overhead, excessive memory consumption, and limitations of traditional RAG—namely its reliance on explicit queries and structured knowledge. To address these issues, we propose a dual-system RAG framework: a lightweight long-range system that constructs a global memory, generates cue-rich preliminary answers, and retrieves relevant passages efficiently; and a heavyweight expressive system that synthesizes high-quality final answers from retrieved content. Our key contributions include (1) a global-memory-augmented RAG paradigm that eliminates dependence on explicit queries and structured knowledge, and (2) a reinforcement learning-guided KV compression mechanism (RLGF) that dynamically optimizes memory representations. Extensive evaluation across diverse long-context benchmarks demonstrates substantial improvements over state-of-the-art baselines—particularly in complex scenarios where conventional RAG fails—achieving superior trade-offs between efficiency and effectiveness.
📝 Abstract
Processing long contexts presents a significant challenge for large language models (LLMs). While recent advancements allow LLMs to handle much longer contexts than before (e.g., 32K or 128K tokens), it is computationally expensive and can still be insufficient for many applications. Retrieval-Augmented Generation (RAG) is considered a promising strategy to address this problem. However, conventional RAG methods face inherent limitations because of two underlying requirements: 1) explicitly stated queries, and 2) well-structured knowledge. These conditions, however, do not hold in general long-context processing tasks. In this work, we propose MemoRAG, a novel RAG framework empowered by global memory-augmented retrieval. MemoRAG features a dual-system architecture. First, it employs a light but long-range system to create a global memory of the long context. Once a task is presented, it generates draft answers, providing useful clues for the retrieval tools to locate relevant information within the long context. Second, it leverages an expensive but expressive system, which generates the final answer based on the retrieved information. Building upon this fundamental framework, we realize the memory module in the form of KV compression, and reinforce its memorization and cluing capacity from the Generation quality's Feedback (a.k.a. RLGF). In our experiments, MemoRAG achieves superior performances across a variety of long-context evaluation tasks, not only complex scenarios where traditional RAG methods struggle, but also simpler ones where RAG is typically applied.