When Stored Evidence Stops Being Usable: Scale-Conditioned Evaluation of Agent Memory

📅 2026-05-08

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

This study addresses the limitations of existing agent memory evaluations, which typically rely on static snapshots and fail to capture how the availability of task-relevant evidence degrades under continuous injection of irrelevant dialogue. To this end, the authors propose a scale-conditioned evaluation protocol that fixes task-critical evidence while incrementally introducing irrelevant conversations, thereby tracing memory utilization trajectories over time. They further introduce four novel diagnostic metrics—including usable scale boundary and budget-compliant reliability—to uncover the multidimensional mechanisms underlying memory failure. Systematic evaluations are conducted on Qwen3-series models using the LongMemEval and LoCoMo benchmarks across flat, planar, and hierarchical memory interfaces. Results reveal that HippoRAG suffers a 16–20% drop in reliability even within a two-call budget, while LiCoMemory’s performance is highly model-size-dependent, with only large-scale models maintaining reliability within the tested regime.

📝 Abstract

Memory-agent evaluations report fixed-snapshot accuracy or retrieval quality, but these scores do not show whether evidence remains usable as irrelevant sessions (sessions not annotated as task-relevant evidence for the query) accumulate. We present a scale-conditioned evaluation protocol for agent memory under evidence-preserving growth: for each query, task evidence is held fixed while irrelevant sessions are added. The protocol logs agent--memory trajectories and reports four diagnostics: budget-compliant reliability, tail memory-call burden, failure-regime decomposition, and the usable-scale boundary where reliability falls below the target. Applied to LongMemEval and LoCoMo across flat, planar, and hierarchical memory interfaces, the protocol shows reliability loss is not a single phenomenon. On LongMemEval, HippoRAG stays within the two-call budget but loses 16--20 percentage points in budget-compliant reliability as irrelevant sessions are added; LiCoMemory's observed failures depend strongly on the agent, with Qwen3-8B exceeding the budget while Qwen3-32B and Qwen3-235B remain reliable in the tested range. The result supports a framework for making scalable-memory claims conditional on agent, interface, scale range, and interaction budget.

Problem

Research questions and friction points this paper is trying to address.

agent memory

evidence usability

scale-conditioned evaluation

memory scalability

irrelevant sessions

Innovation

Methods, ideas, or system contributions that make the work stand out.

scale-conditioned evaluation

agent memory

evidence-preserving growth