EMemBench: Interactive Benchmarking of Episodic Memory for VLM Agents

📅 2026-01-23

📈 Citations: 0

✨ Influential: 0

career value

231K/year

🤖 AI Summary

This work addresses the lack of effective benchmarks for evaluating episodic memory in vision-language model (VLM) agents within interactive environments. We propose the first interactive episodic memory evaluation framework that dynamically generates agent-trajectory-based, skill-balanced, and automatically verifiable questions through procedural templates in both textual and visual modalities. Ground-truth answers are automatically annotated using underlying game signals. Experiments across 15 text-based games and multiple visual environments reveal that inductive and spatial reasoning remain key bottlenecks; while persistent memory mechanisms benefit text-based agents, they yield limited gains for VLMs. Human evaluations confirm the benchmark’s challenge level and validity, establishing it as a robust tool for assessing episodic memory in interactive multimodal agents.

Technology Category

Application Category

📝 Abstract

We introduce EMemBench, a programmatic benchmark for evaluating long-term memory of agents through interactive games. Rather than using a fixed set of questions, EMemBench generates questions from each agent's own trajectory, covering both text and visual game environments. Each template computes verifiable ground truth from underlying game signals, with controlled answerability and balanced coverage over memory skills: single/multi-hop recall, induction, temporal, spatial, logical, and adversarial. We evaluate memory agents with strong LMs/VLMs as backbones, using in-context prompting as baselines. Across 15 text games and multiple visual seeds, results are far from saturated: induction and spatial reasoning are persistent bottlenecks, especially in visual setting. Persistent memory yields clear gains for open backbones on text games, but improvements are less consistent for VLM agents, suggesting that visually grounded episodic memory remains an open challenge. A human study further confirms the difficulty of EMemBench.

Problem

Research questions and friction points this paper is trying to address.

episodic memory

visual language models

interactive benchmarking

long-term memory

memory evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

episodic memory

interactive benchmarking

visual language models