MemLens: Benchmarking Multimodal Long-Term Memory in Large Vision-Language Models

📅 2026-05-14

📈 Citations: 0

✨ Influential: 0

career value

237K/year

🤖 AI Summary

This work addresses the lack of systematic benchmarks for evaluating long-term memory capabilities of large vision-language models (LVLMs) and memory-augmented agents in multi-turn, multi-session, and multimodal settings. To this end, we introduce MEMLENS, a novel benchmark encompassing five memory capability categories and four context lengths, featuring the first cross-modal token counting mechanism to establish a multimodal long-term memory evaluation framework. Through multi-session dialogue construction, cross-modal context modeling, and image ablation studies, we demonstrate the necessity of visual evidence for reasoning. Evaluations on 27 LVLMs and 7 memory agents reveal significant performance degradation in LVLMs as context length increases, reduced visual fidelity in memory agents due to compression, and generally sub-30% accuracy in multi-session reasoning—highlighting the critical need for architectures that effectively integrate both paradigms.

📝 Abstract

Memory is essential for large vision-language models (LVLMs) to handle long, multimodal interactions, with two method directions providing this capability: long-context LVLMs and memory-augmented agents. However, no existing benchmark conducts a systematic comparison of the two on questions that genuinely require multimodal evidence. To close this gap, we introduce MEMLENS, a comprehensive benchmark for memory in multimodal multi-session conversations, comprising 789 questions across five memory abilities (information extraction, multi-session reasoning, temporal reasoning, knowledge update, and answer refusal) at four standard context lengths (32K-256K tokens) under a cross-modal token-counting scheme. An image-ablation study confirms that solving MEMLENS requires visual evidence: removing evidence images drops two frontier LVLMs below 2% accuracy on the 80.4% of questions whose evidence includes images. Evaluating 27 LVLMs and 7 memory-augmented agents, we find that long-context LVLMs achieve high short-context accuracy through direct visual grounding but degrade as conversations grow, whereas memory agents are length-stable but lose visual fidelity under storage-time compression. Multi-session reasoning caps most systems below 30%, and neither approach alone solves the task. These results motivate hybrid architectures that combine long-context attention with structured multimodal retrieval. Our code is available at https://github.com/xrenaf/MEMLENS.

Problem

Research questions and friction points this paper is trying to address.

multimodal memory

vision-language models

long-term memory

benchmarking

multi-session reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal memory

vision-language models

long-context modeling