AMA-Bench: Evaluating Long-Horizon Memory for Agentic Applications

📅 2026-02-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing benchmarks for evaluating agent memory are largely confined to human–agent dialogues and fail to capture the challenges posed by machine-generated, continuous interactions in real-world environments. To address this gap, this work proposes AMA-Bench—the first long-term memory evaluation benchmark tailored for real-world agent applications—featuring both real and synthetic interaction trajectories paired with corresponding question-answering tasks. The authors also introduce AMA-Agent, a memory system that integrates causal graph modeling and tool-augmented retrieval. Their analysis reveals that current approaches are limited by a lack of causal reasoning and goal awareness, as well as overreliance on lossy similarity-based retrieval. On AMA-Bench, AMA-Agent achieves an average accuracy of 57.22%, outperforming the strongest baseline by 11.16%, thereby significantly advancing the evaluation and modeling of long-term memory in intelligent agents.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) are deployed as autonomous agents in increasingly complex applications, where enabling long-horizon memory is critical for achieving strong performance. However, a significant gap exists between practical applications and current evaluation standards for agent memory: existing benchmarks primarily focus on dialogue-centric, human-agent interactions. In reality, agent memory consists of a continuous stream of agent-environment interactions that are primarily composed of machine-generated representations. To bridge this gap, we introduce AMA-Bench (Agent Memory with Any length), which evaluates long-horizon memory for LLMs in real agentic applications. It features two key components: (1) a set of real-world agentic trajectories across representative agentic applications, paired with expert-curated QA, and (2) a set of synthetic agentic trajectories that scale to arbitrary horizons, paired with rule-based QA. Our comprehensive study shows that existing memory systems underperform on AMA-Bench primarily because they lack causality and objective information and are constrained by the lossy nature of similarity-based retrieval employed by many memory systems. To address these limitations, we propose AMA-Agent, an effective memory system featuring a causality graph and tool-augmented retrieval. Our results demonstrate that AMA-Agent achieves 57.22% average accuracy on AMA-Bench, surpassing the strongest memory system baselines by 11.16%.
Problem

Research questions and friction points this paper is trying to address.

long-horizon memory
agentic applications
agent-environment interactions
memory evaluation
large language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

long-horizon memory
agentic applications
causality graph
tool-augmented retrieval
memory benchmark
🔎 Similar Papers
No similar papers found.