EvolMem: A Cognitive-Driven Benchmark for Multi-Session Dialogue Memory

๐Ÿ“… 2026-01-07
๐Ÿ›๏ธ arXiv.org
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This study addresses the lack of systematic evaluation of diverse memory capabilities in large language models within multi-turn conversational settings. Drawing on cognitive psychology theories, the authors introduce the first benchmark specifically designed to assess both declarative and non-declarative memory in multi-session dialogues, decomposing memory performance into fine-grained, measurable dimensions. To support this benchmark, they propose a hybrid data synthesis framework that combines topic-guided generation with narrative-inspired heuristics, enabling controlled generation of dialogue data with varying complexity, along with sample-level evaluation guidelines. Experimental results reveal significant performance disparities across different memory dimensions in current models, and demonstrate that prevailing memory mechanisms do not consistently enhance performanceโ€”often hindered by computational inefficiencies and scalability bottlenecks.

Technology Category

Application Category

๐Ÿ“ Abstract
Despite recent advances in understanding and leveraging long-range conversational memory, existing benchmarks still lack systematic evaluation of large language models(LLMs) across diverse memory dimensions, particularly in multi-session settings. In this work, we propose EvolMem, a new benchmark for assessing multi-session memory capabilities of LLMs and agent systems. EvolMem is grounded in cognitive psychology and encompasses both declarative and non-declarative memory, further decomposed into multiple fine-grained abilities. To construct the benchmark, we introduce a hybrid data synthesis framework that consists of topic-initiated generation and narrative-inspired transformations. This framework enables scalable generation of multi-session conversations with controllable complexity, accompanied by sample-specific evaluation guidelines. Extensive evaluation reveals that no LLM consistently outperforms others across all memory dimensions. Moreover, agent memory mechanisms do not necessarily enhance LLMs'capabilities and often exhibit notable efficiency limitations. Data and code will be released at https://github.com/shenye7436/EvolMem.
Problem

Research questions and friction points this paper is trying to address.

multi-session dialogue memory
large language models
cognitive psychology
memory benchmark
declarative and non-declarative memory
Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-session dialogue memory
cognitive-driven benchmark
hybrid data synthesis
declarative and non-declarative memory
fine-grained evaluation
๐Ÿ”Ž Similar Papers
No similar papers found.
Ye Shen
Ye Shen
Baylor College of Medicine
D
Dun Pei
Shanghai Jiao Tong University, Shanghai Artificial Intelligence Laboratory
Y
Yiqiu Guo
Shanghai Artificial Intelligence Laboratory, Fudan University
Junying Wang
Junying Wang
PhD Student at Shanghai AI Lab & Fudan University
LMM benchmarkAIGCAI Safety
Y
Yijin Guo
Shanghai Jiao Tong University, Shanghai Artificial Intelligence Laboratory
Zicheng Zhang
Zicheng Zhang
Shanghai AI Lab
Multi-modal LLMQuality assessment
Q
Qi Jia
Shanghai Artificial Intelligence Laboratory
J
Jun Zhou
Shanghai Jiao Tong University
Guangtao Zhai
Guangtao Zhai
Professor, IEEE Fellow, Shanghai Jiao Tong University
Multimedia Signal ProcessingVisual Quality AssessmentQoEAI EvaluationDisplays