MemEye: A Visual-Centric Evaluation Framework for Multimodal Agent Memory

📅 2026-05-14

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

Existing evaluation methods struggle to assess whether multimodal agents retain critical visual evidence necessary for reasoning, particularly lacking fine-grained and dynamic state-change evaluations. This work proposes MemEye, a novel framework that introduces visual centrality into multimodal memory assessment. It constructs a new benchmark spanning eight everyday scenarios, structured along two dimensions: granularity of visual evidence (from scene-level to pixel-level) and usage patterns (from single-evidence to temporally evolved synthesis). Through ablation-based verification gating, multi-granularity annotations, and cross-temporal state tracking, the study systematically evaluates 13 memory mechanisms and four vision-language models. Results reveal that current systems generally fail to preserve fine-grained details or reason about temporal state transitions, underscoring the critical roles of evidence routing, temporal tracking, and detail extraction in effective multimodal reasoning.

📝 Abstract

Long-term agent memory is increasingly multimodal, yet existing evaluations rarely test whether agents preserve the visual evidence needed for later reasoning. In prior work, many visually grounded questions can be answered using only captions or textual traces, allowing answers to be inferred without preserving the fine-grained visual evidence. Meanwhile, harder cases that require reasoning over changing visual states are largely absent. Therefore, we introduce MemEye, a framework that evaluates memory capabilities from two dimensions: one measures the granularity of decisive visual evidence (from scene-level to pixel-level evidence), and the other measures how retrieved evidence must be used (from single evidence to evolutionary synthesis). Under this framework, we construct a new benchmark across 8 life-scenario tasks, with ablation-driven validation gates for assessing answerability, shortcut resistance, visual necessity, and reasoning structure. By evaluating 13 memory methods across 4 VLM backbones, we show that current architectures still struggle to preserve fine-grained visual details and reason about state changes over time. Our findings show that long-term multimodal memory depends on evidence routing, temporal tracking, and detail extraction.

Problem

Research questions and friction points this paper is trying to address.

multimodal agent memory

visual evidence preservation

long-term memory evaluation

visual reasoning

state change reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

visual-centric evaluation

multimodal agent memory

evidence granularity