MemoryBench: A Benchmark for Memory and Continual Learning in LLM Systems

📅 2025-10-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing LLM evaluation benchmarks predominantly focus on homogeneous reading comprehension and long-context understanding, overlooking the critical capability of continual learning from user feedback during the service phase. To address this gap, we introduce CLMBench—the first benchmark explicitly designed for continual learning from real-world user feedback in deployed LLM services. CLMBench spans multiple domains, languages, tasks, and modalities, and incorporates a feedback simulation mechanism, dynamic knowledge updating protocols, and rigorous memory retention evaluation. Methodologically, it features a multi-dimensional task architecture and a cross-lingual evaluation framework to faithfully emulate interactive, online learning scenarios. Empirical evaluation reveals that state-of-the-art LLMs exhibit significant limitations in both continual learning efficiency and knowledge retention. These findings underscore CLMBench’s value in advancing research on LLM memory modeling and online optimization algorithms. (149 words)

Technology Category

Application Category

📝 Abstract
Scaling up data, parameters, and test-time computation has been the mainstream methods to improve LLM systems (LLMsys), but their upper bounds are almost reached due to the gradual depletion of high-quality data and marginal gains obtained from larger computational resource consumption. Inspired by the abilities of human and traditional AI systems in learning from practice, constructing memory and continual learning frameworks for LLMsys has become an important and popular research direction in recent literature. Yet, existing benchmarks for LLM memory often focus on evaluating the system on homogeneous reading comprehension tasks with long-form inputs rather than testing their abilities to learn from accumulated user feedback in service time. Therefore, we propose a user feedback simulation framework and a comprehensive benchmark covering multiple domains, languages, and types of tasks to evaluate the continual learning abilities of LLMsys. Experiments show that the effectiveness and efficiency of state-of-the-art baselines are far from satisfying, and we hope this benchmark could pave the way for future studies on LLM memory and optimization algorithms.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM memory in homogeneous reading comprehension tasks
Testing continual learning from accumulated user feedback
Assessing LLM systems across domains languages and tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Simulates user feedback for LLM memory evaluation
Covers multiple domains, languages, and task types
Benchmarks continual learning in LLM systems
🔎 Similar Papers
No similar papers found.
Qingyao Ai
Qingyao Ai
Associate Professor, Dept. of CS&T, Tsinghua University
Information RetrievalMachine Learning
Y
Yichen Tang
Department of Computer Science and Technology, Tsinghua University
Changyue Wang
Changyue Wang
Tsinghua University
Information RetrievalLarge Language ModelsAI for Legal
J
Jianming Long
Department of Computer Science and Technology, Tsinghua University
Weihang Su
Weihang Su
Tsinghua University
Information RetrievalNatural Language ProcessingAI for Legal
Y
Yiqun Liu
Department of Computer Science and Technology, Tsinghua University