Convomem Benchmark: Why Your First 150 Conversations Don't Need RAG

πŸ“… 2025-11-13
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing dialogue memory evaluation frameworks suffer from insufficient statistical power, inconsistent data generation, and poor flexibility, while neglecting the β€œzero-to-gradual-growth” characteristic of dialogue memory. This paper introduces the first large-scale dialogue memory benchmark (75,336 question-answer pairs), covering diverse scenarios including user facts, preferences, and temporal evolution. We propose a full-context modeling paradigm tailored to this growth property, integrating dynamic knowledge updating, graph-based representation, and implicit relation modeling. Furthermore, we establish a statistically robust, multi-dimensional evaluation framework. Experiments show that, within the first 150 dialogue turns, the full-context approach significantly outperforms mainstream RAG systems (e.g., Mem0): it achieves 70–82% accuracy on multi-message evidence tasks, versus 30–45% for RAG. Our work is the first to rigorously delineate the applicability boundaries between long-context and hybrid methods.

Technology Category

Application Category

πŸ“ Abstract
We introduce a comprehensive benchmark for conversational memory evaluation containing 75,336 question-answer pairs across diverse categories including user facts, assistant recall, abstention, preferences, temporal changes, and implicit connections. While existing benchmarks have advanced the field, our work addresses fundamental challenges in statistical power, data generation consistency, and evaluation flexibility that limit current memory evaluation frameworks. We examine the relationship between conversational memory and retrieval-augmented generation (RAG). While these systems share fundamental architectural patterns--temporal reasoning, implicit extraction, knowledge updates, and graph representations--memory systems have a unique characteristic: they start from zero and grow progressively with each conversation. This characteristic enables naive approaches that would be impractical for traditional RAG. Consistent with recent findings on long context effectiveness, we observe that simple full-context approaches achieve 70-82% accuracy even on our most challenging multi-message evidence cases, while sophisticated RAG-based memory systems like Mem0 achieve only 30-45% when operating on conversation histories under 150 interactions. Our analysis reveals practical transition points: long context excels for the first 30 conversations, remains viable with manageable trade-offs up to 150 conversations, and typically requires hybrid or RAG approaches beyond that point as costs and latencies become prohibitive. These patterns indicate that the small-corpus advantage of conversational memory--where exhaustive search and complete reranking are feasible--deserves dedicated research attention rather than simply applying general RAG solutions to conversation histories.
Problem

Research questions and friction points this paper is trying to address.

Evaluating conversational memory systems faces statistical power and consistency challenges
Determining when retrieval-augmented generation becomes necessary versus simple approaches
Identifying practical transition points between memory strategies as conversations grow
Innovation

Methods, ideas, or system contributions that make the work stand out.

Full-context approaches outperform RAG in early conversations
Progressive memory growth enables simple yet effective solutions
Hybrid approaches required beyond 150 conversation threshold
πŸ”Ž Similar Papers
No similar papers found.