Minerva: A Programmable Memory Test Benchmark for Language Models

📅 2025-02-05

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

Existing LLM memory evaluation benchmarks suffer from static task design, susceptibility to overfitting, poor interpretability, and limited diagnostic capability, hindering precise identification of model failure modes. To address these limitations, we propose MemBench—the first programmable memory benchmark—featuring procedurally generated, structured tasks that cover atomic memory operations including search, recall, editing, matching, comparison, and state maintenance, as well as multi-step, state-dependent composite scenarios. Methodologically, MemBench integrates structured context chunking, decoupled atomic capability modeling, and composable task orchestration. This framework enables fine-grained, attributable, and reproducible memory assessment, substantially improving test coverage and diagnostic precision. By providing modular, interpretable, and debuggable evaluation, MemBench establishes a novel paradigm for systematic investigation of LLM memory mechanisms.

Technology Category

Application Category

📝 Abstract

How effectively can LLM-based AI assistants utilize their memory (context) to perform various tasks? Traditional data benchmarks, which are often manually crafted, suffer from several limitations: they are static, susceptible to overfitting, difficult to interpret, and lack actionable insights--failing to pinpoint the specific capabilities a model lacks when it does not pass a test. In this paper, we present a framework for automatically generating a comprehensive set of tests to evaluate models' abilities to use their memory effectively. Our framework extends the range of capability tests beyond the commonly explored (passkey, key-value, needle in the haystack) search, a dominant focus in the literature. Specifically, we evaluate models on atomic tasks such as searching, recalling, editing, matching, comparing information in context memory, and performing basic operations when inputs are structured into distinct blocks, simulating real-world data. Additionally, we design composite tests to investigate the models' ability to maintain state while operating on memory. Our benchmark enables an interpretable, detailed assessment of memory capabilities of LLMs.

Problem

Research questions and friction points this paper is trying to address.

Evaluate LLM memory utilization effectiveness

Automate comprehensive memory capability tests

Assess LLMs on atomic and composite tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated comprehensive memory test generation

Extended capability tests beyond common searches

Interpretable detailed memory capability assessment

🔎 Similar Papers

No similar papers found.