🤖 AI Summary
This study addresses the memory mechanisms required for large language models (LLMs) to achieve genuine conversational continuity and experience-based learning in long-context question answering. To this end, we introduce LoCoMo—the first synthetic benchmark specifically designed for evaluating memory capabilities in long-context settings—and systematically assess five memory-augmentation paradigms: full-context prompting, retrieval-augmented generation (RAG), in-context learning, agent memory, and procedural memory. Experimental results reveal that model scale critically influences memory architecture suitability: smaller models benefit most from lightweight RAG, whereas stronger reasoning models rely heavily on structured episodic memory to delineate knowledge boundaries. Moreover, optimal memory configurations reduce token consumption by over 90% while preserving high answer accuracy. Our core contributions are (1) a reproducible, fine-grained evaluation framework for long-context memory, and (2) empirical evidence of the tight coupling between memory requirements and intrinsic model capabilities.
📝 Abstract
In order for large language models to achieve true conversational continuity and benefit from experiential learning, they need memory. While research has focused on the development of complex memory systems, it remains unclear which types of memory are most effective for long-context conversational tasks. We present a systematic evaluation of memory-augmented methods using LoCoMo, a benchmark of synthetic long-context dialogues annotated for question-answering tasks that require diverse reasoning strategies. We analyse full-context prompting, semantic memory through retrieval-augmented generation and agentic memory, episodic memory through in-context learning, and procedural memory through prompt optimization. Our findings show that memory-augmented approaches reduce token usage by over 90% while maintaining competitive accuracy. Memory architecture complexity should scale with model capability, with small foundation models benefitting most from RAG, and strong instruction-tuned reasoning model gaining from episodic learning through reflections and more complex agentic semantic memory. In particular, episodic memory can help LLMs recognise the limits of their own knowledge.