Is One Score Enough? Rethinking the Evaluation of Sequentially Evolving LLM Memory

📅 2026-05-14

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

Current evaluations of large language model (LLM) memory rely on single aggregate metrics, which fail to uncover critical failure modes such as forgetting and negative transfer. This work proposes SeqMem-Eval, a diagnostic framework that, for the first time, adapts multidimensional evaluation principles from continual learning to LLM memory assessment. Without updating model parameters, SeqMem-Eval leverages a prompt-based external memory mechanism to track memory dynamics along four fine-grained dimensions: online utility, generalization, backward transfer, and forgetting. Experimental results demonstrate that high final accuracy does not necessarily indicate high-quality memory retention, and reveal that different memory strategies exhibit distinct and significant trade-offs between stability and adaptability.

📝 Abstract

Memory plays a central role in enabling large language models (LLMs) to operate over sequential tasks by accumulating and reusing experience over time. However, existing evaluations of LLM memory mostly rely on aggregate metrics such as final hold-out accuracy or cumulative online performance, which can obscure critical failure modes such as forgetting and negative transfer. In this paper, we introduce SeqMem-Eval, a diagnostic evaluation framework for sequentially evolving LLM memory. Drawing inspiration from continual learning, it targets a test-time setting in which memory is external, prompt-mediated, and updated without modifying model parameters. Rather than focusing only on final performance, SeqMem-Eval evaluates how memory states evolve, generalize, consolidate experience, and retain useful information during sequential inference. Specifically, it measures online utility, hold-out generalization, backward transfer, and forgetting, providing a finer-grained view of memory quality. Through extensive experiments across diverse tasks and memory methods, we show that higher final or cumulative accuracy does not necessarily imply better memory quality: many methods exhibit strong performance gains while suffering from substantial forgetting or negative transfer. Moreover, different memory designs exhibit distinct trade-offs between adaptability and stability that remain invisible under standard evaluation metrics.

Problem

Research questions and friction points this paper is trying to address.

LLM memory

sequential tasks

evaluation metrics

forgetting

negative transfer

Innovation

Methods, ideas, or system contributions that make the work stand out.

SeqMem-Eval

LLM memory evaluation

continual learning