WorldMemArena: Evaluating Multimodal Agent Memory Through Action-World Interaction

📅 2026-05-28

📈 Citations: 0

✨ Influential: 0

career value

243K/year

🤖 AI Summary

Current evaluation methods struggle to diagnose the root causes of memory failures across the full lifecycle—encoding, retention, retrieval, and utilization—in multimodal agents performing long-horizon tasks, and lack systematic comparisons of self-managed memory mechanisms. This work proposes the first memory lifecycle framework grounded in agent–environment interaction and introduces WorldMemArena, a benchmark comprising 400 multi-session tasks with gold-annotated memory points, updates, distractors, and evidence chains. Leveraging multimodal large language models, real-world observations, and action trajectories, the framework enables end-to-end evaluation of retrieval-augmented generation, long-context, and external memory systems. Experiments reveal that improvements in memory encoding and storage do not necessarily enhance overall performance; multimodal memories often fail to effectively leverage visual evidence; existing systems exhibit cross-domain instability and degraded performance on real trajectories; and while self-managed memory offers greater flexibility, it incurs higher computational costs and lower reliability.

📝 Abstract

Multimodal large language models are increasingly deployed as long-horizon agents, where memory must do more than recall: it must track an evolving world, revise what has gone stale, and surface the right evidence at decision time. Existing benchmarks measure recall over static dialogue, collapse memory into a single end-of-task accuracy, and reduce visual observations to captions, leaving us unable to localize failures to writing, maintenance, retrieval, or use. The rise of agent harnesses that author their own memory sharpens this gap, since we have no principled way to compare hand-designed pipelines with self-managing alternatives. To close these gaps, we formulate multimodal agent memory as an Action-World Interaction Loop with an observable four-stage lifecycle, and instantiate it in WorldMemArena: 400 multi-session multimodal tasks spanning Lifelong Evolution (evolving personal and task states) and Agentic Execution (memory from real observations, actions, and feedback), annotated with gold memory points, updates, distractors, and evidence chains for stage-level diagnosis. This enables the first head-to-head comparison of long-context, manually designed (RAG and external memory systems), and harness-based memory agents. Results show that: (1) better memory writing and storage do not guarantee better performance; (2) multimodal memory still struggles to fully use visual evidence; (3) systems are unstable across domains and degrade on realistic agentic trajectories; and (4) harness memory is more flexible but remains costly and less reliable.

Problem

Research questions and friction points this paper is trying to address.

multimodal agent memory

memory evaluation

action-world interaction

memory lifecycle

benchmarking

Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal agent memory

Action-World Interaction Loop

memory lifecycle

WorldMemArena

harness-based memory

🔎 Similar Papers

MMInA: Benchmarking Multihop Multimodal Internet Agents

2024-04-15arXiv.orgCitations: 22

MLLM as Retriever: Interactively Learning Multimodal Retrieval for Embodied Agents

2024-10-04International Conference on Learning RepresentationsCitations: 0