🤖 AI Summary
Large language models (LLMs) lack human-like working memory, hindering active maintenance, manipulation, and retrieval of intermediate reasoning states within latent representations—leading to inconsistent and irrational responses. This work is the first to systematically identify and characterize this deficiency, proposing a novel evaluation paradigm centered on internal representational capacity. We design three context-decoupled tasks—Number Guessing, Yes-No Deduction, and Math Magic—to rigorously assess latent-space information retention and transformation under minimal external prompting. Evaluating 17 state-of-the-art models across four major architectures, we consistently observe failure to stably preserve reasoning states without auxiliary cues. Our study establishes the first reproducible, cross-architectural working memory benchmark for LLMs, publicly releasing code and prompt templates. This benchmark enables fine-grained cognitive modeling of LLM reasoning and opens new avenues for architectural and training-level enhancements targeting working memory functionality.
📝 Abstract
While Large Language Models (LLMs) exhibit remarkable reasoning abilities, we demonstrate that they lack a fundamental aspect of human cognition: working memory. Human working memory is an active cognitive system that enables not only the temporary storage of information but also its processing and utilization, enabling coherent reasoning and decision-making. Without working memory, individuals may produce unrealistic responses, exhibit self-contradictions, and struggle with tasks that require mental reasoning. Existing evaluations using N-back or context-dependent tasks fall short as they allow LLMs to exploit external context rather than retaining the reasoning process in the latent space. We introduce three novel tasks: (1) Number Guessing, (2) Yes-No Deduction, and (3) Math Magic, designed to isolate internal representation from external context. Across seventeen frontier models spanning four major model families, we consistently observe irrational or contradictory behaviors, indicating LLMs'inability to retain and manipulate latent information. Our work establishes a new benchmark for evaluating working memory in LLMs and highlights this limitation as a key bottleneck for advancing reliable reasoning systems. Code and prompts for the experiments are available at https://github.com/penguinnnnn/LLM-Working-Memory.