Same Ranking, Different Winner: How Scoring Targets Shape LLM Memory Benchmarks

📅 2026-05-21

📈 Citations: 0

✨ Influential: 0

career value

159K/year

🤖 AI Summary

Current evaluations of conversational memory fail to specify the form of memory—such as raw utterances, factual statements, or canonical summaries—to which retrieval scores should be attributed, leading to inconsistent or even reversed conclusions. This work proposes TIAP, a method that leverages a fixed-output rescore mechanism to re-evaluate existing retrieval rankings against three distinct scoring targets—Raw, Source, and Canonical—without requiring re-execution of the retrieval process. It systematically uncovers and quantifies the “scoring-target non-invariance” problem in memory evaluation and establishes an audit framework that operates without rerunning systems. On LoCoMo and LongMemEval-S, merely changing the scoring target significantly alters nDCG for 83.4%–94.0% of queries, reverses the performance ranking between Mem0 and MemoryOS, and inverts parser-density recommendations. Human semantic audits further reveal that only 29.2% of current lenient source-linking scores are semantically justified.

📝 Abstract

Conversational-memory systems increasingly transform dialogue history into facts, summaries, timelines, and other source-linked descendants, so a single source turn can coexist with several derived memories in the same retrieval index. This raises an underspecified evaluation question: which stored form should receive retrieval credit? We show that this scoring-target choice is often left implicit and can materially change benchmark conclusions. We present TIAP, a fixed-output audit that rescores saved ranked outputs under three targets -- Raw, Source, and Canonical -- without rerunning retrieval. On LoCoMo and LongMemEval-S, switching only the credited target changes nDCG on 83.4--94.0 percent of shared queries, flips target orderings on Mem0 and MemoryOS transfer runs, and reverses parser-density recommendations. A 1,902-case semantic audit further shows that relaxed source-linked credit is fully justified only 29.2 percent of the time, despite high rubric reliability in a validation subset. These results reveal target noninvariance: conclusions about memory architectures can silently flip with a single benchmark-design choice. Conversational-memory papers should therefore define and report the scoring target explicitly.

Problem

Research questions and friction points this paper is trying to address.

conversational memory

retrieval evaluation

scoring target

benchmark design

target noninvariance

Innovation

Methods, ideas, or system contributions that make the work stand out.

scoring target

memory benchmark

target noninvariance