Evaluating Memory Capability in Continuous Lifelog Scenario

📅 2026-04-13

📈 Citations: 0

✨ Influential: 0

career value

231K/year

🤖 AI Summary

Existing memory system evaluation benchmarks fail to align with the real-world demands of wearable devices in continuous lifelogging scenarios. To address this gap, this work proposes LifeDialBench—the first benchmark specifically designed for evaluating continuous conversational memory—comprising EgoMem, grounded in real first-person videos, and LifeMem, derived from simulated virtual communities. The benchmark introduces a time-causal online evaluation protocol that strictly prevents temporal leakage. Leveraging a hierarchical synthesis framework, the dataset is constructed to support streaming inputs and is evaluated with retrieval-augmented generation (RAG) baselines. Experiments reveal that current state-of-the-art memory systems underperform even simple RAG approaches, highlighting how over-engineering and lossy compression compromise contextual fidelity and underscoring the critical importance of high-fidelity long-term memory retention.

Technology Category

Application Category

📝 Abstract

Nowadays, wearable devices can continuously lifelog ambient conversations, creating substantial opportunities for memory systems. However, existing benchmarks primarily focus on online one-on-one chatting or human-AI interactions, thus neglecting the unique demands of real-world scenarios. Given the scarcity of public lifelogging audio datasets, we propose a hierarchical synthesis framework to curate \textbf{\textsc{LifeDialBench}}, a novel benchmark comprising two complementary subsets: \textbf{EgoMem}, built on real-world egocentric videos, and \textbf{LifeMem}, constructed using simulated virtual community. Crucially, to address the issue of temporal leakage in traditional offline settings, we propose an \textbf{Online Evaluation} protocol that strictly adheres to temporal causality, ensuring systems are evaluated in a realistic streaming fashion. Our experimental results reveal a counterintuitive finding: current sophisticated memory systems fail to outperform a simple RAG-based baseline. This highlights the detrimental impact of over-designed structures and lossy compression in current approaches, emphasizing the necessity of high-fidelity context preservation for lifelog scenarios. We release our code and data at https://github.com/qys77714/LifeDialBench.

Problem

Research questions and friction points this paper is trying to address.

lifelog

memory systems

continuous conversation

temporal causality

evaluation benchmark

Innovation

Methods, ideas, or system contributions that make the work stand out.

lifelog memory

online evaluation

temporal causality

LifeDialBench

RAG baseline

🔎 Similar Papers

No similar papers found.