EverMemBench: Benchmarking Long-Term Interactive Memory in Large Language Models

📅 2026-02-01

📈 Citations: 0

✨ Influential: 0

career value

232K/year

🤖 AI Summary

This work addresses the limitation of existing large language model evaluation benchmarks, which are predominantly confined to single-turn dialogue scenarios and thus inadequate for assessing memory capabilities in complex, long-term, multi-character interactions. To this end, the authors propose EverMemBench—the first benchmark supporting multi-character, cross-topic, temporally evolving long-context dialogues—comprising over one million tokens of conversational data and more than 1,000 structured question-answer pairs. The benchmark introduces a three-dimensional evaluation framework encompassing fine-grained recall, memory awareness, and user profile comprehension. Experimental results reveal significant bottlenecks in current models regarding multi-hop reasoning, temporal semantic modeling, and implicit memory retrieval: even under ideal conditions, model accuracy on multi-character, multi-hop tasks remains as low as 26%, and conventional similarity-based retrieval methods fail to bridge the semantic gap between queries and implicit memories.

Technology Category

Application Category

📝 Abstract

Long-term conversational memory is essential for LLM-based assistants, yet existing benchmarks focus on dyadic, single-topic dialogues that fail to capture real-world complexity. We introduce EverMemBench, a benchmark featuring multi-party, multi-group conversations spanning over 1 million tokens with temporally evolving information, cross-topic interleaving, and role-specific personas. EverMemBench evaluates memory systems across three dimensions through 1,000+ QA pairs: fine-grained recall, memory awareness, and user profile understanding. Our evaluation reveals critical limitations: (1) multi-hop reasoning collapses in multi-party settings, with even oracle models achieving only 26%; (2) temporal reasoning remains unsolved, requiring version semantics beyond timestamp matching; (3) memory awareness is bottlenecked by retrieval, where current similarity-based methods fail to bridge the semantic gap between queries and implicitly relevant memories. EverMemBench provides a challenging testbed for developing next-generation memory architectures.

Problem

Research questions and friction points this paper is trying to address.

long-term memory

interactive memory

multi-party conversation

temporal reasoning

memory benchmark

Innovation

Methods, ideas, or system contributions that make the work stand out.

long-term memory

multi-party dialogue

temporal reasoning