Episodic Memories Generation and Evaluation Benchmark for Large Language Models

📅 2025-01-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current large language models (LLMs) exhibit significant deficiencies in episodic memory—particularly in multi-event association and complex spatiotemporal reasoning—hindering human-like cognitive capabilities. To address this, we introduce the first explicit memory generation and evaluation benchmark tailored for LLMs. Our method innovatively incorporates a cognitive science–inspired structured event representation framework to construct a contamination-free, multi-granular episodic memory dataset, and defines reproducible evaluation protocols for recall, cross-event association, and temporal reasoning. Experimental results show that state-of-the-art models—including GPT-4, Claude, Llama 3.1, and o1-mini—frequently generate hallucinations and fail to correctly resolve cross-event spatiotemporal constraints, even within 10k–100k-token contexts. We fully open-source the dataset, evaluation code, and benchmarking tools, thereby filling a critical gap in LLM-based episodic memory modeling.

Technology Category

Application Category

📝 Abstract
Episodic memory -- the ability to recall specific events grounded in time and space -- is a cornerstone of human cognition, enabling not only coherent storytelling, but also planning and decision-making. Despite their remarkable capabilities, Large Language Models (LLMs) lack a robust mechanism for episodic memory: we argue that integrating episodic memory capabilities into LLM is essential for advancing AI towards human-like cognition, increasing their potential to reason consistently and ground their output in real-world episodic events, hence avoiding confabulations. To address this challenge, we introduce a comprehensive framework to model and evaluate LLM episodic memory capabilities. Drawing inspiration from cognitive science, we develop a structured approach to represent episodic events, encapsulating temporal and spatial contexts, involved entities, and detailed descriptions. We synthesize a unique episodic memory benchmark, free from contamination, and release open source code and datasets to assess LLM performance across various recall and episodic reasoning tasks. Our evaluation of state-of-the-art models, including GPT-4 and Claude variants, Llama 3.1, and o1-mini, reveals that even the most advanced LLMs struggle with episodic memory tasks, particularly when dealing with multiple related events or complex spatio-temporal relationships -- even in contexts as short as 10k-100k tokens.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Episodic Memory
Temporal Spatial Relationships
Innovation

Methods, ideas, or system contributions that make the work stand out.

Situational Memory Enhancement
Complex Event Reasoning
Unified Benchmark for AI Models
🔎 Similar Papers
No similar papers found.