ES-MemEval: Benchmarking Conversational Agents on Personalized Long-Term Emotional Support

📅 2026-02-02

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

Existing dialogue systems struggle to effectively memorize and utilize fragmented, implicit, and dynamically evolving user information over extended interactions, leading to insufficient personalization and hallucination issues. This work proposes ES-MemEval, the first evaluation framework tailored for long-term memory in emotionally supportive dialogues, encompassing five dimensions: information extraction, temporal reasoning, conflict detection, refusal capability, and user modeling. Alongside this framework, we introduce EvoEmo, a multi-turn conversational dataset capturing evolving user states. Leveraging long-context language models and retrieval-augmented generation (RAG), we conduct systematic multi-task evaluations to assess memory mechanisms. Experimental results demonstrate that explicit long-term memory significantly reduces hallucinations and enhances personalization, while RAG improves factual consistency but remains limited in modeling the dynamic evolution of user states.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have shown strong potential as conversational agents. Yet, their effectiveness remains limited by deficiencies in robust long-term memory, particularly in complex, long-term web-based services such as online emotional support. However, existing long-term dialogue benchmarks primarily focus on static and explicit fact retrieval, failing to evaluate agents in critical scenarios where user information is dispersed, implicit, and continuously evolving. To address this gap, we introduce ES-MemEval, a comprehensive benchmark that systematically evaluates five core memory capabilities: information extraction, temporal reasoning, conflict detection, abstention, and user modeling, in long-term emotional support settings, covering question answering, summarization, and dialogue generation tasks. To support the benchmark, we also propose EvoEmo, a multi-session dataset for personalized long-term emotional support that captures fragmented, implicit user disclosures and evolving user states. Extensive experiments on open-source long-context, commercial, and retrieval-augmented (RAG) LLMs show that explicit long-term memory is essential for reducing hallucinations and enabling effective personalization. At the same time, RAG improves factual consistency but struggles with temporal dynamics and evolving user states. These findings highlight both the potential and limitations of current paradigms and motivate more robust integration of memory and retrieval for long-term personalized dialogue systems.

Problem

Research questions and friction points this paper is trying to address.

long-term memory

emotional support

conversational agents

personalization

dialogue benchmark

Innovation

Methods, ideas, or system contributions that make the work stand out.

long-term memory

emotional support

conversational agents