🤖 AI Summary
Existing evaluations of large language models’ memory capabilities predominantly focus on retaining static information across multi-turn dialogues, which fails to capture an agent’s ability to dynamically construct and leverage memory over extended tasks—limiting the generalization of memory systems in real-world scenarios such as programming and web interaction. To address this, this work introduces MemGym, a benchmark that systematically assesses long-horizon memory through a memory-reasoning decoupling interface across four task categories: tool use, deep search, programming, and computer operation. It innovatively incorporates a memory-isolation scoring mechanism to disentangle memory performance from reasoning, retrieval, and tool invocation. The framework includes synthetic evaluation pipelines MEMGYM-DR and MEMGYM-CODEQA and features MemRM, a lightweight memory reward model built by fine-tuning Qwen3-1.7B with QLoRA. This approach enables unconfounded ranking of diverse memory strategies and significantly enhances evaluation efficiency in complex settings by replacing full environment rollouts with scalar scores, thereby advancing scalable long-term memory research.
📝 Abstract
Memory is a central capability for LLM agents operating across long-horizon tasks. Existing memory benchmarks predominantly evaluate retention of personalized information in multi-turn chat scenarios, overlooking the dynamic memory formation that occurs during extended agent execution. Consequently, the memory systems they produce transfer poorly to realistic agentic environments, such as coding and web navigation. We present MemGym, a benchmark for agentic memory that unifies existing agent gyms and in-house memory-grounded pipelines behind one memory-reasoning interface. MemGym spans five evaluation tracks grouped into four agentic regimes: tool-use dialogue (tau2-bench), multi-turn deep-research search (MEMGYM-DR), coding (SWE-Gym and MEMGYM-CODEQA), and computer use (WebArena-Infinity). MemGym reports memory-isolated scores that decouple memory performance from reasoning, retrieval, and tool-use ability, so memory strategies can be ranked without those confounders. Our synthetic pipelines for MEMGYM-CODEQA and MEMGYM-DR are length-controllable, ablation-verified at every stage, and tightly aligned with downstream scenarios. To make evaluation on coding environments academically tractable, we train MemRM, a lightweight reward model (Qwen3-1.7B fine-tuned with QLoRA) that scores compression quality as a fast scalar read in place of full Docker rollouts.