EvolMem: A Cognitive-Driven Benchmark for Multi-Session Dialogue Memory

📅 2026-01-07

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

238K/year

🤖 AI Summary

This study addresses the lack of systematic evaluation of diverse memory capabilities in large language models within multi-turn conversational settings. Drawing on cognitive psychology theories, the authors introduce the first benchmark specifically designed to assess both declarative and non-declarative memory in multi-session dialogues, decomposing memory performance into fine-grained, measurable dimensions. To support this benchmark, they propose a hybrid data synthesis framework that combines topic-guided generation with narrative-inspired heuristics, enabling controlled generation of dialogue data with varying complexity, along with sample-level evaluation guidelines. Experimental results reveal significant performance disparities across different memory dimensions in current models, and demonstrate that prevailing memory mechanisms do not consistently enhance performance—often hindered by computational inefficiencies and scalability bottlenecks.

Technology Category

Application Category

📝 Abstract

Despite recent advances in understanding and leveraging long-range conversational memory, existing benchmarks still lack systematic evaluation of large language models(LLMs) across diverse memory dimensions, particularly in multi-session settings. In this work, we propose EvolMem, a new benchmark for assessing multi-session memory capabilities of LLMs and agent systems. EvolMem is grounded in cognitive psychology and encompasses both declarative and non-declarative memory, further decomposed into multiple fine-grained abilities. To construct the benchmark, we introduce a hybrid data synthesis framework that consists of topic-initiated generation and narrative-inspired transformations. This framework enables scalable generation of multi-session conversations with controllable complexity, accompanied by sample-specific evaluation guidelines. Extensive evaluation reveals that no LLM consistently outperforms others across all memory dimensions. Moreover, agent memory mechanisms do not necessarily enhance LLMs'capabilities and often exhibit notable efficiency limitations. Data and code will be released at https://github.com/shenye7436/EvolMem.

Problem

Research questions and friction points this paper is trying to address.

multi-session dialogue memory

large language models

cognitive psychology

memory benchmark

declarative and non-declarative memory

Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-session dialogue memory

cognitive-driven benchmark

hybrid data synthesis