Evaluating Long-Term Memory for Long-Context Question Answering

📅 2025-10-27

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

This study addresses the memory mechanisms required for large language models (LLMs) to achieve genuine conversational continuity and experience-based learning in long-context question answering. To this end, we introduce LoCoMo—the first synthetic benchmark specifically designed for evaluating memory capabilities in long-context settings—and systematically assess five memory-augmentation paradigms: full-context prompting, retrieval-augmented generation (RAG), in-context learning, agent memory, and procedural memory. Experimental results reveal that model scale critically influences memory architecture suitability: smaller models benefit most from lightweight RAG, whereas stronger reasoning models rely heavily on structured episodic memory to delineate knowledge boundaries. Moreover, optimal memory configurations reduce token consumption by over 90% while preserving high answer accuracy. Our core contributions are (1) a reproducible, fine-grained evaluation framework for long-context memory, and (2) empirical evidence of the tight coupling between memory requirements and intrinsic model capabilities.

Technology Category

Application Category

📝 Abstract

In order for large language models to achieve true conversational continuity and benefit from experiential learning, they need memory. While research has focused on the development of complex memory systems, it remains unclear which types of memory are most effective for long-context conversational tasks. We present a systematic evaluation of memory-augmented methods using LoCoMo, a benchmark of synthetic long-context dialogues annotated for question-answering tasks that require diverse reasoning strategies. We analyse full-context prompting, semantic memory through retrieval-augmented generation and agentic memory, episodic memory through in-context learning, and procedural memory through prompt optimization. Our findings show that memory-augmented approaches reduce token usage by over 90% while maintaining competitive accuracy. Memory architecture complexity should scale with model capability, with small foundation models benefitting most from RAG, and strong instruction-tuned reasoning model gaining from episodic learning through reflections and more complex agentic semantic memory. In particular, episodic memory can help LLMs recognise the limits of their own knowledge.

Problem

Research questions and friction points this paper is trying to address.

Evaluating memory types for long-context question answering effectiveness

Reducing token usage while maintaining accuracy in memory-augmented methods

Determining optimal memory architecture scaling with model capabilities

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluated memory-augmented methods using LoCoMo benchmark

Memory approaches reduce token usage by over 90%

Episodic memory helps LLMs recognize knowledge limits

🔎 Similar Papers

No similar papers found.