🤖 AI Summary
To address human preference bias, benchmark overfitting, and poor distributional adaptability of generic compression algorithms in long-term memory for large language models (LLMs), this paper proposes SUMER: an end-to-end reinforcement learning framework based on experience replay. SUMER abandons hand-crafted memory compression paradigms and, for the first time, enables goal-directed autonomous retrieval and reasoning directly over raw, uncompressed memories. It achieves dynamic information aggregation via coordinated tool invocation between Qwen2.5-7B-Instruct and verifiable reward-based reinforcement learning (RLVR). Its core innovation lies in a goal-driven memory access mechanism, advancing long-term memory evaluation toward dynamic, scalable benchmarks. On the LoCoMo dataset, SUMER outperforms all existing compression methods and full-context baselines, achieving a 43% performance gain and establishing new state-of-the-art (SOTA) results.
📝 Abstract
How to enable human-like long-term memory in large language models (LLMs) has been a central question for unlocking more general capabilities such as few-shot generalization. Existing memory frameworks and benchmarks focus on finding the optimal memory compression algorithm for higher performance in tasks that require recollection and sometimes further reasoning. However, such efforts have ended up building more human bias into the compression algorithm, through the search for the best prompts and memory architectures that suit specific benchmarks, rather than finding a general solution that would work on other data distributions. On the other hand, goal-directed search on uncompressed information could potentially exhibit superior performance because compression is lossy, and a predefined compression algorithm will not fit all raw data distributions. Here we present SUMER (Search in Uncompressed Memory via Experience Replay), an end-to-end reinforcement learning agent with verifiable reward (RLVR) that learns to use search tools to gather information and answer a target question. On the LoCoMo dataset for long-context conversation understanding, SUMER with Qwen2.5-7B-Instruct learned to use search tools and outperformed all other biased memory compression approaches and also the full-context baseline, reaching SOTA performance (43% gain over the prior best). We demonstrate that a simple search method applied to raw data outperforms goal-agnostic and biased compression algorithms in current long-context memory tasks, arguing for new paradigms and benchmarks that are more dynamic and autonomously scalable. Code for SUMER and all implemented baselines is publicly available at https://github.com/zycyc/SUMER.