🤖 AI Summary
This work addresses a critical gap in evaluating large language models (LLMs) by extending memory assessment beyond explicit factual recall to encompass implicit memory capabilities. Drawing on non-declarative memory theories from cognitive science, the study introduces the first systematic evaluation framework for implicit memory in LLMs, grounded in three mechanisms: procedural memory, priming effects, and classical conditioning. The authors implement a unified “learning/priming–interference–testing” protocol and develop a standardized benchmark comprising 300 tasks, incorporating paired experiments, first-trial scoring, and interference controls to quantify models’ ability to automatically leverage prior experience without explicit prompting. Evaluation of 17 prominent LLMs reveals overall weak performance (peak accuracy: 65.3%), substantially below human levels, with pronounced asymmetries in inhibition and preference, underscoring the urgent need for architectural innovations beyond mere parameter scaling.
📝 Abstract
Existing memory benchmarks for LLM agents evaluate explicit recall of facts, yet overlook implicit memory where experience becomes automated behavior without conscious retrieval. This gap is critical: effective assistants must automatically apply learned procedures or avoid failed actions without explicit reminders. We introduce ImplicitMemBench, the first systematic benchmark evaluating implicit memory through three cognitively grounded constructs drawn from standard cognitive-science accounts of non-declarative memory: Procedural Memory (one-shot skill acquisition after interference), Priming (theme-driven bias via paired experimental/control instances), and Classical Conditioning (Conditioned Stimulus--Unconditioned Stimulus (CS--US) associations shaping first decisions). Our 300-item suite employs a unified Learning/Priming-Interfere-Test protocol with first-attempt scoring. Evaluation of 17 models reveals severe limitations: no model exceeds 66% overall, with top performers DeepSeek-R1 (65.3%), Qwen3-32B (64.1%), and GPT-5 (63.0%) far below human baselines. Analysis uncovers dramatic asymmetries (inhibition 17.6% vs. preference 75.0%) and universal bottlenecks requiring architectural innovations beyond parameter scaling. ImplicitMemBench reframes evaluation from "what agents recall" to "what they automatically enact".