🤖 AI Summary
This work addresses the performance degradation of large language model (LLM) agents during prolonged interactions, often caused by unstable memory systems or accumulated latency. The authors propose LightMem, a novel framework that introduces a small language model to construct a lightweight, modular memory system partitioned into short-, medium-, and long-term components, with decoupled online retrieval and offline consolidation pipelines. Key innovations include a two-stage online retrieval mechanism combining vector-based coarse search with semantic re-ranking, user-aware memory isolation via user identifiers, and an incremental strategy for long-term memory integration. Evaluated on the LoCoMo benchmark, LightMem achieves an average F1 score improvement of 2.5, with a median retrieval latency of only 83 ms and end-to-end latency of 581 ms, effectively balancing efficacy and efficiency.
📝 Abstract
Although LLM agents can leverage tools for complex tasks, they still need memory to maintain cross-turn consistency and accumulate reusable information in long-horizon interactions. However, retrieval-based external memory systems incur low online overhead but suffer from unstable accuracy due to limited query construction and candidate filtering. In contrast, many systems use repeated large-model calls for online memory operations, improving accuracy but accumulating latency over long interactions. We propose LightMem, a lightweight memory system for better agent memory driven by Small Language Models (SLMs). LightMem modularizes memory retrieval, writing, and long-term consolidation, and separates online processing from offline consolidation to enable efficient memory invocation under bounded compute. We organize memory into short-term memory (STM) for immediate conversational context, mid-term memory (MTM) for reusable interaction summaries, and long-term memory (LTM) for consolidated knowledge, and uses user identifiers to support independent retrieval and incremental maintenance in multi-user settings. Online, LightMem operates under a fixed retrieval budget and selects memories via a two-stage procedure: vector-based coarse retrieval followed by semantic consistency re-ranking. Offline, it abstracts reusable interaction evidence and incrementally integrates it into LTM. Experiments show gains across model scales, with an average F1 improvement of about 2.5 on LoCoMo, more effective and low median latency (83 ms retrieval; 581 ms end-to-end).