🤖 AI Summary
Existing memory evaluation benchmarks are confined to conversational settings and fail to capture enterprise-scale agent collaboration requirements—namely, multi-platform operation, asynchrony, information conflicts, and cross-dependency management. This paper introduces the first benchmark for long-term memory and state tracking in dynamic, multi-platform agent environments, simulating realistic software development workflows integrating Slack, Linear, and Git. It addresses core challenges including conflict resolution, cross-platform state consistency maintenance, and code-aware reasoning. We innovatively construct a high-fidelity, scalable synthetic dataset combining expert curation with LLM-generated artifacts. We further propose a three-dimensional evaluation framework—Correctness, Efficiency, and Redundancy—to holistically assess memory fidelity and reasoning quality. Empirical results reveal that even state-of-the-art LLMs (e.g., GPT-5) achieve only 60% Correctness, exposing fundamental limitations in long-horizon memory retention, cross-platform dependency modeling, and contradictory information reasoning.
📝 Abstract
Recent works on context and memory benchmarking have primarily focused on conversational instances but the need for evaluating memory in dynamic enterprise environments is crucial for its effective application. We introduce MEMTRACK, a benchmark designed to evaluate long-term memory and state tracking in multi-platform agent environments. MEMTRACK models realistic organizational workflows by integrating asynchronous events across multiple communication and productivity platforms such as Slack, Linear and Git. Each benchmark instance provides a chronologically platform-interleaved timeline, with noisy, conflicting, cross-referring information as well as potential codebase/file-system comprehension and exploration. Consequently, our benchmark tests memory capabilities such as acquistion, selection and conflict resolution. We curate the MEMTRACK dataset through both manual expert driven design and scalable agent based synthesis, generating ecologically valid scenarios grounded in real world software development processes. We introduce pertinent metrics for Correctness, Efficiency, and Redundancy that capture the effectiveness of memory mechanisms beyond simple QA performance. Experiments across SoTA LLMs and memory backends reveal challenges in utilizing memory across long horizons, handling cross-platform dependencies, and resolving contradictions. Notably, the best performing GPT-5 model only achieves a 60% Correctness score on MEMTRACK. This work provides an extensible framework for advancing evaluation research for memory-augmented agents, beyond existing focus on conversational setups, and sets the stage for multi-agent, multi-platform memory benchmarking in complex organizational settings