🤖 AI Summary
Existing memory evaluation methods based on static off-policy data struggle to reliably and scalably assess large language models’ (LLMs’) long-term memory capabilities in extended dialogues. This work proposes the first online, interactive memory evaluation framework that integrates structured user profiles, state-dependent questions, and evolving trajectories, leveraging LLMs to simulate user role-playing while preserving state consistency and generating high-quality conversations. By combining structured state evolution with free-form dialogue, the framework enables diagnostic evaluation and self-evolving optimization of memory strategies. Through a multi-dimensional metric system, it reveals performance gaps and underlying causes among mainstream approaches—including retrieval-augmented generation (RAG), long-context LLMs, and agent-based memory systems—in long-horizon dialogues, offering empirical guidance for the selection and improvement of memory architectures.
📝 Abstract
Long-horizon interactions between users and LLM-based assistants necessitate effective memory management, yet current approaches face challenges in training and evaluation of memory. Existing memory benchmarks rely on static, off-policy data as context, limiting evaluation reliability and scalability. To address these gaps, we introduce AMemGym, an interactive environment enabling on-policy evaluation and optimization for memory-driven personalization. AMemGym employs structured data sampling to predefine user profiles, state-dependent questions, and state evolution trajectories, enabling cost-effective generation of high-quality, evaluation-aligned interactions. LLM-simulated users expose latent states through role-play while maintaining structured state consistency. Comprehensive metrics based on structured data guide both assessment and optimization of assistants. Extensive experiments reveal performance gaps in existing memory systems (e.g., RAG, long-context LLMs, and agentic memory) and corresponding reasons. AMemGym not only enables effective selection among competing approaches but also can potentially drive the self-evolution of memory management strategies. By bridging structured state evolution with free-form interactions, our framework provides a scalable, diagnostically rich environment for advancing memory capabilities in conversational agents.