MemoryRewardBench: Benchmarking Reward Models for Long-Term Memory Management in Large Language Models

📅 2026-01-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the absence of systematic benchmarks for evaluating reward models in long-term memory management within large language models. To this end, the authors introduce the first benchmark specifically designed for assessing reward models in this context, featuring long-context understanding and generation tasks spanning 8K–128K tokens and encompassing ten distinct memory management strategies. The study conducts a comprehensive evaluation of thirteen state-of-the-art reward models, revealing that the performance gap between open-source and closed-source models has substantially narrowed, with newer generations consistently outperforming their predecessors—improvements not attributable to parameter count alone. Furthermore, the analysis uncovers fundamental limitations in current reward models’ ability to evaluate memory-related capabilities, thereby offering critical insights and directions for future research.

Technology Category

Application Category

📝 Abstract
Existing works increasingly adopt memory-centric mechanisms to process long contexts in a segment manner, and effective memory management is one of the key capabilities that enables large language models to effectively propagate information across the entire sequence. Therefore, leveraging reward models (RMs) to automatically and reliably evaluate memory quality is critical. In this work, we introduce MemoryRewardBench, the first benchmark to systematically study the ability of RMs to evaluate long-term memory management processes. MemoryRewardBench covers both long-context comprehension and long-form generation tasks, featuring 10 distinct settings with different memory management patterns, with context length ranging from 8K to 128K tokens. Evaluations on 13 cutting-edge RMs indicate a diminishing performance gap between open-source and proprietary models, with newer-generation models consistently outperforming their predecessors regardless of parameter count. We further expose the capabilities and fundamental limitations of current RMs in evaluating LLM memory management across diverse settings.
Problem

Research questions and friction points this paper is trying to address.

reward models
long-term memory management
large language models
long-context comprehension
benchmark
Innovation

Methods, ideas, or system contributions that make the work stand out.

reward models
long-term memory management
benchmark
large language models
long-context evaluation
🔎 Similar Papers
No similar papers found.
Z
Zecheng Tang
Soochow University, China; LCM Laboratory
B
Baibei Ji
Soochow University, China; LCM Laboratory
R
Ruoxi Sun
Soochow University, China; LCM Laboratory
Haitian Wang
Haitian Wang
University of Western Australia
3D point cloudComputer visionMachine leaningIoTRemote sensing
W
Wangjie You
Soochow University, China
Y
Yijun Zhang
China Mobile (Suzhou), China
W
Wenpeng Zhu
China Mobile (Suzhou), China
J
Ji Qi
China Mobile (Suzhou), China
Juntao Li
Juntao Li
Soochow University
Language ModelsText Generation
Min Zhang
Min Zhang
Professor of Computer Science, Soochow University
Statistical Machine TranslationNatural Language Processing and Computational LinguisticsIntelligent ComputingMachine Learning