EvoMemBench: Benchmarking Agent Memory from a Self-Evolving Perspective

📅 2026-05-18

📈 Citations: 0

✨ Influential: 0

career value

219K/year

🤖 AI Summary

This work addresses the lack of systematic evaluation of memory mechanisms in existing large language model agent benchmarks. The authors propose the first dual-axis evaluation framework specifically designed for agent memory, structured along two dimensions: memory scope (within-session vs. cross-session interactions) and memory content (knowledge-oriented vs. execution-oriented). Under a standardized protocol, they construct a unified benchmark to systematically compare 15 representative memory methods against strong long-context baselines. Their findings reveal that no single memory solution is universally effective; instead, the efficacy of memory mechanisms is highly task-dependent. While long-context baselines remain competitive, retrieval-based approaches excel in knowledge-intensive scenarios, and procedural or long-term memory strategies outperform others on execution-oriented tasks when aligned with the underlying task structure.

📝 Abstract

Recent benchmarks for Large Language Model (LLM) agents mainly evaluate reasoning, planning, and execution. However, memory is also essential for agents, as it enables them to store, update, and retrieve information over time. This ability remains under-evaluated, largely because existing benchmarks do not provide a systematic way to assess memory mechanisms. In this paper, we study agent memory from a self-evolving perspective and introduce EvoMemBench, a unified benchmark organized along two axes: memory scope (in-episode vs. cross-episode) and memory content (knowledge-oriented vs. execution-oriented). We compare 15 representative memory methods with strong long-context baselines under a standardized protocol. Results show that current memory systems are still far from a general solution: long-context baselines remain highly competitive, memory helps most when the current context is insufficient or tasks are difficult, and no single memory form works consistently across all settings. Retrieval-based methods remain strong for knowledge-intensive settings, whereas procedural and long-term memory methods are more effective for execution-oriented tasks when their stored experience matches the task structure. We hope EvoMemBench facilitates future research on more effective memory systems for LLM-based agents. Our code is available at https://github.com/DSAIL-Memory/EvoMemBench.

Problem

Research questions and friction points this paper is trying to address.

agent memory

benchmarking

large language models

memory evaluation

self-evolving

Innovation

Methods, ideas, or system contributions that make the work stand out.

agent memory

self-evolving

memory benchmark