🤖 AI Summary
This work addresses the limitation of existing large language model evaluation benchmarks, which are predominantly confined to single-turn dialogue scenarios and thus inadequate for assessing memory capabilities in complex, long-term, multi-character interactions. To this end, the authors propose EverMemBench—the first benchmark supporting multi-character, cross-topic, temporally evolving long-context dialogues—comprising over one million tokens of conversational data and more than 1,000 structured question-answer pairs. The benchmark introduces a three-dimensional evaluation framework encompassing fine-grained recall, memory awareness, and user profile comprehension. Experimental results reveal significant bottlenecks in current models regarding multi-hop reasoning, temporal semantic modeling, and implicit memory retrieval: even under ideal conditions, model accuracy on multi-character, multi-hop tasks remains as low as 26%, and conventional similarity-based retrieval methods fail to bridge the semantic gap between queries and implicit memories.
📝 Abstract
Long-term conversational memory is essential for LLM-based assistants, yet existing benchmarks focus on dyadic, single-topic dialogues that fail to capture real-world complexity. We introduce EverMemBench, a benchmark featuring multi-party, multi-group conversations spanning over 1 million tokens with temporally evolving information, cross-topic interleaving, and role-specific personas. EverMemBench evaluates memory systems across three dimensions through 1,000+ QA pairs: fine-grained recall, memory awareness, and user profile understanding. Our evaluation reveals critical limitations: (1) multi-hop reasoning collapses in multi-party settings, with even oracle models achieving only 26%; (2) temporal reasoning remains unsolved, requiring version semantics beyond timestamp matching; (3) memory awareness is bottlenecked by retrieval, where current similarity-based methods fail to bridge the semantic gap between queries and implicitly relevant memories. EverMemBench provides a challenging testbed for developing next-generation memory architectures.