🤖 AI Summary
Current large language models (LLMs) exhibit significant deficiencies in episodic memory—particularly in multi-event association and complex spatiotemporal reasoning—hindering human-like cognitive capabilities. To address this, we introduce the first explicit memory generation and evaluation benchmark tailored for LLMs. Our method innovatively incorporates a cognitive science–inspired structured event representation framework to construct a contamination-free, multi-granular episodic memory dataset, and defines reproducible evaluation protocols for recall, cross-event association, and temporal reasoning. Experimental results show that state-of-the-art models—including GPT-4, Claude, Llama 3.1, and o1-mini—frequently generate hallucinations and fail to correctly resolve cross-event spatiotemporal constraints, even within 10k–100k-token contexts. We fully open-source the dataset, evaluation code, and benchmarking tools, thereby filling a critical gap in LLM-based episodic memory modeling.
📝 Abstract
Episodic memory -- the ability to recall specific events grounded in time and space -- is a cornerstone of human cognition, enabling not only coherent storytelling, but also planning and decision-making. Despite their remarkable capabilities, Large Language Models (LLMs) lack a robust mechanism for episodic memory: we argue that integrating episodic memory capabilities into LLM is essential for advancing AI towards human-like cognition, increasing their potential to reason consistently and ground their output in real-world episodic events, hence avoiding confabulations. To address this challenge, we introduce a comprehensive framework to model and evaluate LLM episodic memory capabilities. Drawing inspiration from cognitive science, we develop a structured approach to represent episodic events, encapsulating temporal and spatial contexts, involved entities, and detailed descriptions. We synthesize a unique episodic memory benchmark, free from contamination, and release open source code and datasets to assess LLM performance across various recall and episodic reasoning tasks. Our evaluation of state-of-the-art models, including GPT-4 and Claude variants, Llama 3.1, and o1-mini, reveals that even the most advanced LLMs struggle with episodic memory tasks, particularly when dealing with multiple related events or complex spatio-temporal relationships -- even in contexts as short as 10k-100k tokens.