🤖 AI Summary
Existing agent memory evaluations struggle to assess the ability to internalize environment-specific experiences in customized web settings. To address this gap, this work proposes LongMemEval-V2, a benchmark comprising 451 handcrafted questions and 115M tokens of historical interaction trajectories, establishing the first long-term memory evaluation framework explicitly targeting environmental experience internalization. It defines five core memory capabilities: static state recall, dynamic tracking, workflow comprehension, environmental trap awareness, and prerequisite consciousness. Building on this framework, we introduce the AgentRunbook family of methods: AgentRunbook-R employs a RAG architecture, while AgentRunbook-C integrates code-based agents with file-based trajectory processing, enhanced by a knowledge pool and sandboxing mechanisms. Experiments show that AgentRunbook-C achieves an average accuracy of 72.5%, significantly outperforming the strongest RAG baseline (48.5%) and off-the-shelf code agents (69.3%), albeit with higher latency, thereby highlighting a new direction for jointly optimizing accuracy and efficiency.
📝 Abstract
Long-term memory is crucial for agents in specialized web environments, where success depends on recalling interface affordances, state dynamics, workflows, and recurring failure modes. However, existing memory benchmarks for agents mostly focus on user histories, short traces, or downstream task success, leaving open how to directly evaluate whether memory systems effectively internalize environment-specific experience. To address this gap, we introduce LongMemEval-V2 (LME-V2), a benchmark for evaluating whether memory systems can help agents acquire the experience needed to become knowledgeable colleagues in customized environments. LME-V2 contains 451 manually curated questions covering five core memory abilities for web agents: static state recall, dynamic state tracking, workflow knowledge, environment gotchas, and premise awareness. Questions are paired with history trajectories containing up to 500 trajectories and 115M tokens. We use a context gathering formulation: memory systems consume history trajectories and return compact evidence for downstream question answering. We propose a suite of two memory methods: AgentRunbook-R, an efficient RAG-based memory with knowledge pools for raw state observations, events, and strategy notes, and AgentRunbook-C, which stores trajectories as files and invokes a coding agent to gather evidence in an augmented sandbox. Experiments show that AgentRunbook-C achieves the best performance with 72.5% average accuracy, outperforming the strongest RAG baseline (48.5%) and the off-the-shelf coding agent baseline (69.3%). Despite the strong performance gains, coding agent based methods have high latency costs. While AgentRunbook-C advances the accuracy-latency Pareto frontier, substantial room for improvement remains. Together, these results establish LME-V2 as a challenging testbed for developing long-term memory systems for environment experience.