🤖 AI Summary
Existing memory system evaluation benchmarks fail to align with the real-world demands of wearable devices in continuous lifelogging scenarios. To address this gap, this work proposes LifeDialBench—the first benchmark specifically designed for evaluating continuous conversational memory—comprising EgoMem, grounded in real first-person videos, and LifeMem, derived from simulated virtual communities. The benchmark introduces a time-causal online evaluation protocol that strictly prevents temporal leakage. Leveraging a hierarchical synthesis framework, the dataset is constructed to support streaming inputs and is evaluated with retrieval-augmented generation (RAG) baselines. Experiments reveal that current state-of-the-art memory systems underperform even simple RAG approaches, highlighting how over-engineering and lossy compression compromise contextual fidelity and underscoring the critical importance of high-fidelity long-term memory retention.
📝 Abstract
Nowadays, wearable devices can continuously lifelog ambient conversations, creating substantial opportunities for memory systems. However, existing benchmarks primarily focus on online one-on-one chatting or human-AI interactions, thus neglecting the unique demands of real-world scenarios. Given the scarcity of public lifelogging audio datasets, we propose a hierarchical synthesis framework to curate \textbf{\textsc{LifeDialBench}}, a novel benchmark comprising two complementary subsets: \textbf{EgoMem}, built on real-world egocentric videos, and \textbf{LifeMem}, constructed using simulated virtual community. Crucially, to address the issue of temporal leakage in traditional offline settings, we propose an \textbf{Online Evaluation} protocol that strictly adheres to temporal causality, ensuring systems are evaluated in a realistic streaming fashion. Our experimental results reveal a counterintuitive finding: current sophisticated memory systems fail to outperform a simple RAG-based baseline. This highlights the detrimental impact of over-designed structures and lossy compression in current approaches, emphasizing the necessity of high-fidelity context preservation for lifelog scenarios. We release our code and data at https://github.com/qys77714/LifeDialBench.