🤖 AI Summary
This work addresses the limited ability of existing large language model (LLM) memory systems to diagnose failure modes with fine-grained precision, which hinders root-cause identification of errors. We formalize LLM memory systems into three canonical operations—summarization, storage, and retrieval—and construct adversarial datasets targeting each component to isolate and attribute specific failure modes. Leveraging this framework, we introduce five diagnostic datasets spanning four task categories and conduct a systematic evaluation of four state-of-the-art memory systems. Our empirical analysis reveals distinct performance trade-offs across the three memory operations, offering new insights into their strengths and limitations. This study establishes a novel paradigm for interpretable evaluation and targeted improvement of LLM memory systems.
📝 Abstract
Large language model (LLM) agents increasingly rely on external memory systems to remain consistent across long-horizon interactions, but little empirical work has been done to understand the specific failure modes and design choices that these systems present. Existing benchmarks report aggregate question-answering accuracy and treat memory systems as black boxes, making it impossible to attribute an incorrect answer to a particular failure mode of the system. We introduce MemFail, a diagnostic benchmark that isolates the failure modes of modern LLM memory systems. We begin by formalizing memory systems as the composition of three canonical operations -- summarization, storage, and retrieval -- and identify the potential failure modes induced by each. Based on these hypothesized failure modes, we construct five datasets spanning four tasks, each adversarially designed to test a specific operation of a memory system. Using these datasets, we evaluate four state-of-the-art memory systems on MemFail and demonstrate how MemFail can be used to empirically understand the tradeoffs induced by differences in memory system architectures.