According to Me: Long-Term Personalized Referential Memory QA

📅 2026-03-02

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

Existing long-term memory benchmarks primarily focus on dialogue history and struggle to support multimodal, multi-source personalized referential memory question answering, thus failing to meet the complex referencing and reasoning demands of real-world scenarios. To address this gap, this work proposes ATM-Bench, the first benchmark specifically designed for multimodal, multi-source personalized referential memory QA, along with a Schema-Guided Memory (SGM) approach that leverages structured memory representations to enable cross-source memory integration and reasoning. Experimental results show that current systems achieve less than 20% accuracy on the challenging ATM-Bench-Hard subset, whereas SGM significantly outperforms conventional descriptive memory methods, demonstrating the effectiveness of structured memory representations for complex personalized reasoning tasks.

Technology Category

Application Category

📝 Abstract

Personalized AI assistants must recall and reason over long-term user memory, which naturally spans multiple modalities and sources such as images, videos, and emails. However, existing Long-term Memory benchmarks focus primarily on dialogue history, failing to capture realistic personalized references grounded in lived experience. We introduce ATM-Bench, the first benchmark for multimodal, multi-source personalized referential Memory QA. ATM-Bench contains approximately four years of privacy-preserving personal memory data and human-annotated question-answer pairs with ground-truth memory evidence, including queries that require resolving personal references, multi-evidence reasoning from multi-source and handling conflicting evidence. We propose Schema-Guided Memory (SGM) to structurally represent memory items originated from different sources. In experiments, we implement 5 state-of-the-art memory systems along with a standard RAG baseline and evaluate variants with different memory ingestion, retrieval, and answer generation techniques. We find poor performance (under 20\% accuracy) on the ATM-Bench-Hard set, and that SGM improves performance over Descriptive Memory commonly adopted in prior works. Code available at: https://github.com/JingbiaoMei/ATM-Bench

Problem

Research questions and friction points this paper is trying to address.

personalized memory

multimodal memory

referential QA

long-term memory

memory benchmark

Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal memory

personalized QA

Schema-Guided Memory