Anatomy of Agentic Memory: Taxonomy and Empirical Analysis of Evaluation and System Limitations

📅 2026-02-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current agent memory systems lack reliable evaluation benchmarks and systematic architectural analysis, leading to distorted performance assessments and ad hoc design choices. This work addresses this gap by proposing, for the first time, a taxonomy grounded in four distinct memory architectures, examined through empirical studies across multiple large language model backbones. The study systematically evaluates these architectures in terms of semantic utility, benchmark saturation, model dependency, and memory overhead, uncovering fundamental limitations that explain why real-world performance consistently falls short of theoretical expectations. These findings provide critical empirical evidence and actionable insights for designing scalable, evaluable memory systems in artificial agents.

Technology Category

Application Category

📝 Abstract
Agentic memory systems enable large language model (LLM) agents to maintain state across long interactions, supporting long-horizon reasoning and personalization beyond fixed context windows. Despite rapid architectural development, the empirical foundations of these systems remain fragile: existing benchmarks are often underscaled, evaluation metrics are misaligned with semantic utility, performance varies significantly across backbone models, and system-level costs are frequently overlooked. This survey presents a structured analysis of agentic memory from both architectural and system perspectives. We first introduce a concise taxonomy of MAG systems based on four memory structures. Then, we analyze key pain points limiting current systems, including benchmark saturation effects, metric validity and judge sensitivity, backbone-dependent accuracy, and the latency and throughput overhead introduced by memory maintenance. By connecting the memory structure to empirical limitations, this survey clarifies why current agentic memory systems often underperform their theoretical promise and outlines directions for more reliable evaluation and scalable system design.
Problem

Research questions and friction points this paper is trying to address.

agentic memory
evaluation metrics
benchmark limitations
system overhead
LLM agents
Innovation

Methods, ideas, or system contributions that make the work stand out.

agentic memory
taxonomy
empirical analysis
evaluation metrics
system overhead
🔎 Similar Papers
No similar papers found.
D
Dongming Jiang
University of Texas at Dallas
Y
Yi Li
University of Texas at Dallas
Songtao Wei
Songtao Wei
Ph.D. student at University of Texas at Dallas
Machine LearningDeep LearningLarge Language Models
J
Jinxin Yang
University of Texas at Dallas
A
Ayushi Kishore
University of California, Davis
A
Alysa Zhao
Texas A&M University
D
Dingyi Kang
University of Texas at Dallas
X
Xu Hu
University of Texas at Dallas
Feng Chen
Feng Chen
Department of Computer Science, UT Dallas
Data MiningMachine LearningArtificial Intelligence
Q
Qiannan Li
University of California, Davis
Bingzhe Li
Bingzhe Li
Assistant Professor of Computer Science, University of Texas at Dallas
Intelligent storage systemsSystems for AI/MLDNA storage