π€ AI Summary
Existing long-term memory benchmarks predominantly rely on multi-turn dialogues or synthetic user histories, which inadequately capture a modelβs capacity for deep human understanding. This work proposes the first public benchmark grounded in long-form autobiographical narratives, integrating dense evidence from behaviors, contextual details, and internal mental states to construct a temporally anchored, flashback-aware evaluation pipeline. The benchmark features question-answering tasks that require cross-temporal evidence integration, moving beyond conventional reliance on retrieval accuracy alone. It introduces novel evaluation mechanisms centered on narrative reconstruction, evidence linking, and retrieval-augmented reasoning. Experimental results demonstrate that while current retrieval-augmented systems perform well on factual recall, they exhibit significant limitations in temporal reasoning and higher-order attribution of psychological states.
π Abstract
Existing long-horizon memory benchmarks mostly use multi-turn dialogues or synthetic user histories, which makes retrieval performance an imperfect proxy for person understanding. We present \BenchName, a publicly releasable benchmark built from long-form autobiographical narratives, where actions, context, and inner thoughts provide dense evidence for inferring stable motivations and decision principles. \BenchName~reconstructs each narrative into a flashback-aware, time-anchored stream and evaluates models with evidence-linked questions spanning factual recall, subjective state attribution, and principle-level reasoning. Across diverse narrative sources, retrieval-augmented systems mainly improve factual accuracy, while errors persist on temporally grounded explanations and higher-level inferences, highlighting the need for memory mechanisms beyond retrieval. Our data is in \href{KnowMeBench}{https://github.com/QuantaAlpha/KnowMeBench}.